Comment puis-je lire paresseusement plusieurs valeurs JSON à partir d'un fichier / flux en Python?

Question

J'aimerais lire plusieurs objets JSON à partir d'un fichier/flux en Python, un à la fois. Malheureusement, json.load() juste .read() s jusqu'à la fin du fichier; il ne semble pas y avoir de moyen de l'utiliser pour lire un seul objet ou pour parcourir par la suite les objets.

Est-ce qu'il y a un moyen de faire ça? Utiliser la bibliothèque standard serait l'idéal, mais s'il existe une bibliothèque tierce, je l'utiliserais plutôt.

Pour le moment, je mets chaque objet sur une ligne distincte et j'utilise json.loads(f.readline()), mais je préférerais vraiment ne pas avoir à le faire.

Exemple d'utilisation

exemple.py

import my_json as json import sys for o in json.iterload(sys.stdin): print("Working on a", type(o))

in.txt

{"foo": ["bar", "baz"]} 1 2 [] 4 5 6

exemple de session

$ python3.2 example.py < in.txt Working on a dict Working on a int Working on a int Working on a list Working on a int Working on a int Working on a int

Nic Watson · Accepted Answer

Voici une solution beaucoup, beaucoup plus simple. Le secret consiste à essayer, échouer et utiliser les informations de l'exception pour analyser correctement. La seule limite est que le fichier doit pouvoir être recherché.

def stream_read_json(fn): import json start_pos = 0 with open(fn, 'r') as f: while True: try: obj = json.load(f) yield obj return except json.JSONDecodeError as e: f.seek(start_pos) json_str = f.read(e.pos) obj = json.loads(json_str) start_pos += e.pos yield obj

Edit: je viens de remarquer que cela ne fonctionnera que pour Python> = 3.5. Pour les échecs antérieurs, une erreur ValueError est renvoyée et vous devez analyser la position à partir de la chaîne, par exemple.

def stream_read_json(fn): import json import re start_pos = 0 with open(fn, 'r') as f: while True: try: obj = json.load(f) yield obj return except ValueError as e: f.seek(start_pos) end_pos = int(re.match('Extra data: line \d+ column \d+ .*$char (\d+).*$', e.args[0]).groups()[0]) json_str = f.read(end_pos) obj = json.loads(json_str) start_pos += end_pos yield obj

Thomas K · Answer

JSON n’est généralement pas très efficace pour ce type d’utilisation incrémentielle; il n'y a pas de méthode standard pour sérialiser plusieurs objets afin qu'ils puissent être facilement chargés un à la fois, sans analyser le lot en entier.

La solution objet par ligne que vous utilisez est également visible ailleurs. Scrapy appelle cela les "lignes JSON":

Vous pouvez le faire un peu plus pythoniquement:

for jsonline in f: yield json.loads(jsonline) # or do the processing in this loop

Je pense que c'est à peu près la meilleure façon - cela ne dépend pas de bibliothèques tierces, et il est facile de comprendre ce qui se passe. Je l'ai également utilisé dans certains de mes propres codes.

Krumelur · Answer

Un peu tard peut-être, mais j'ai eu ce problème exact (enfin, plus ou moins). Ma solution standard à ces problèmes consiste généralement à créer une division regex sur un objet racine bien connu, mais dans mon cas, cela était impossible. La seule façon possible de procéder de manière générique consiste à mettre en œuvre un tokenizer approprié .

N'ayant pas trouvé de solution générique et assez performante, j'ai fini par le faire moi-même en écrivant le module splitstream . Il s'agit d'un pré-générateur de jetons qui comprend JSON et XML et divise un flux continu en plusieurs morceaux à analyser (il laisse toutefois l'analyse réelle à vous). Pour en tirer quelque performance, il s’agit d’un module C.

Exemple:

from splitstream import splitfile for jsonstr in splitfile(sys.stdin, format="json")): yield json.loads(jsonstr)

Benedict · Answer

En fait, c'est un problème assez désagréable, car vous devez diffuser en lignes, mais la correspondance de motif sur plusieurs lignes contre des accolades, mais aussi la correspondance de motif json. C'est une sorte de json-preparse suivi d'un json parse. Json est, en comparaison avec d'autres formats, facile à analyser, il n'est donc pas toujours nécessaire de rechercher une bibliothèque d'analyse. Néanmoins, comment devons-nous résoudre ces problèmes conflictuels?

Générateurs à la rescousse!

La beauté des générateurs pour un problème comme celui-ci est que vous pouvez les empiler les uns sur les autres, en éliminant progressivement la difficulté du problème tout en maintenant la paresse. J'ai également envisagé d'utiliser le mécanisme de transmission des valeurs dans un générateur (send ()), mais heureusement, je n'ai pas eu besoin de l'utiliser.

Pour résoudre le premier des problèmes, vous avez besoin d'une sorte de streamingfinditer, en tant que version streaming de re.finditer. Ma tentative ci-dessous tire les lignes au besoin (décommentez la déclaration de débogage pour voir) tout en retournant les correspondances. En fait, je l'ai ensuite légèrement modifié pour produire des lignes non appariées ainsi que des correspondances (marquées par 0 ou 1 dans la première partie du Tuple cédé).

import re def streamingfinditer(pat,stream): for s in stream: # print "Read next line: " + s while 1: m = re.search(pat,s) if not m: yield (0,s) break yield (1,m.group()) s = re.split(pat,s,1)[1]

Avec cela, il est ensuite possible de faire correspondre jusqu'à des accolades, d'indiquer chaque fois si les accolades sont équilibrées, puis de renvoyer des objets simples ou composés selon le cas.

braces='{}[]' whitespaceesc=' 	' bracesesc='\'+'\'.join(braces) balancemap=dict(Zip(braces,[1,-1,1,-1])) bracespat='['+bracesesc+']' nobracespat='[^'+bracesesc+']*' untilbracespat=nobracespat+bracespat def simpleorcompoundobjects(stream): obj = "" unbalanced = 0 for (c,m) in streamingfinditer(re.compile(untilbracespat),stream): if (c == 0): # remainder of line returned, nothing interesting if (unbalanced == 0): yield (0,m) else: obj += m if (c == 1): # match returned if (unbalanced == 0): yield (0,m[:-1]) obj += m[-1] else: obj += m unbalanced += balancemap[m[-1]] if (unbalanced == 0): yield (1,obj) obj=""

Ceci retourne les tuples comme suit:

(0,"String of simple non-braced objects easy to parse") (1,"{ 'Compound' : 'objects' }")

Fondamentalement, c'est la partie méchante faite. Il ne nous reste plus qu'à effectuer le niveau final d'analyse à notre convenance. Par exemple, nous pouvons utiliser la fonction iterload de Jeremy Roman (Merci!) Pour analyser une seule ligne:

def streamingiterload(stream): for c,o in simpleorcompoundobjects(stream): for x in iterload(o): yield x

Essaye-le:

of = open("test.json","w") of.write("""[ "hello" ] { "goodbye" : 1 } 1 2 { } 2 9 78 4 5 { "animals" : [ "dog" , "lots of mice" , "cat" ] } """) of.close() // open & stream the json f = open("test.json","r") for o in streamingiterload(f.readlines()): print o f.close()

J'obtiens ces résultats (et si vous activez cette ligne de débogage, vous verrez qu'elle tire les lignes au besoin):

[u'hello'] {u'goodbye': 1} 1 2 {} 2 9 78 4 5 {u'animals': [u'dog', u'lots of mice', u'cat']}

Cela ne fonctionnera pas pour toutes les situations. En raison de la mise en œuvre de la bibliothèque json, il est impossible de fonctionner entièrement correctement sans réimplémenter l'analyseur vous-même.

Jeremy Roman · Answer

Bien sûr, vous pouvez le faire. Vous devez juste prendre à raw_decode directement. Cette implémentation charge le fichier entier en mémoire et fonctionne sur cette chaîne (un peu comme json.load Est-ce que); Si vous avez des fichiers volumineux, vous pouvez le modifier pour ne le lire que si nécessaire sans difficulté.

import json from json.decoder import WHITESPACE def iterload(string_or_fp, cls=json.JSONDecoder, **kwargs): if isinstance(string_or_fp, file): string = string_or_fp.read() else: string = str(string_or_fp) decoder = cls(**kwargs) idx = WHITESPACE.match(string, 0).end() while idx < len(string): obj, end = decoder.raw_decode(string, idx) yield obj idx = WHITESPACE.match(string, end).end()

Utilisation: comme vous l'avez demandé, c'est un générateur.

Tarun Lalwani · Answer

Je crois qu'une meilleure façon de le faire serait d'utiliser une machine à états. Voici un exemple de code que j'ai élaboré en convertissant un code NodeJS sur le lien ci-dessous en Python ~~3 (used nonlocal keyword only available in Python 3, code won't work on Python 2)~~

Edit-1: Code mis à jour et compatible avec Python 2

Edit-2: Mise à jour et ajout d'une version pour Python3 uniquement

https://Gist.github.com/creationix/5992451

Version Python 3 uniquement

# A streaming byte oriented JSON parser. Feed it a single byte at a time and # it will emit complete objects as it comes across them. Whitespace within and # between objects is ignored. This means it can parse newline delimited JSON. import math def json_machine(emit, next_func=None): def _value(byte_data): if not byte_data: return if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _value # Ignore whitespace if byte_data == 0x22: # " return string_machine(on_value) if byte_data == 0x2d or (0x30 <= byte_data < 0x40): # - or 0-9 return number_machine(byte_data, on_number) if byte_data == 0x7b: #: return object_machine(on_value) if byte_data == 0x5b: # [ return array_machine(on_value) if byte_data == 0x74: # t return constant_machine(TRUE, True, on_value) if byte_data == 0x66: # f return constant_machine(FALSE, False, on_value) if byte_data == 0x6e: # n return constant_machine(NULL, None, on_value) if next_func == _value: raise Exception("Unexpected 0x" + str(byte_data)) return next_func(byte_data) def on_value(value): emit(value) return next_func def on_number(number, byte): emit(number) return _value(byte) next_func = next_func or _value return _value TRUE = [0x72, 0x75, 0x65] FALSE = [0x61, 0x6c, 0x73, 0x65] NULL = [0x75, 0x6c, 0x6c] def constant_machine(bytes_data, value, emit): i = 0 length = len(bytes_data) def _constant(byte_data): nonlocal i if byte_data != bytes_data[i]: i += 1 raise Exception("Unexpected 0x" + str(byte_data)) i += 1 if i < length: return _constant return emit(value) return _constant def string_machine(emit): string = "" def _string(byte_data): nonlocal string if byte_data == 0x22: # " return emit(string) if byte_data == 0x5c: # \ return _escaped_string if byte_data & 0x80: # UTF-8 handling return utf8_machine(byte_data, on_char_code) if byte_data < 0x20: # ASCII control character raise Exception("Unexpected control character: 0x" + str(byte_data)) string += chr(byte_data) return _string def _escaped_string(byte_data): nonlocal string if byte_data == 0x22 or byte_data == 0x5c or byte_data == 0x2f: # " \ / string += chr(byte_data) return _string if byte_data == 0x62: # b string += "\b" return _string if byte_data == 0x66: # f string += "\f" return _string if byte_data == 0x6e: # n string += "
" return _string if byte_data == 0x72: # r string += "
" return _string if byte_data == 0x74: # t string += "	" return _string if byte_data == 0x75: # u return hex_machine(on_char_code) def on_char_code(char_code): nonlocal string string += chr(char_code) return _string return _string # Nestable state machine for UTF-8 Decoding. def utf8_machine(byte_data, emit): left = 0 num = 0 def _utf8(byte_data): nonlocal num, left if (byte_data & 0xc0) != 0x80: raise Exception("Invalid byte in UTF-8 character: 0x" + byte_data.toString(16)) left = left - 1 num |= (byte_data & 0x3f) << (left * 6) if left: return _utf8 return emit(num) if 0xc0 <= byte_data < 0xe0: # 2-byte UTF-8 Character left = 1 num = (byte_data & 0x1f) << 6 return _utf8 if 0xe0 <= byte_data < 0xf0: # 3-byte UTF-8 Character left = 2 num = (byte_data & 0xf) << 12 return _utf8 if 0xf0 <= byte_data < 0xf8: # 4-byte UTF-8 Character left = 3 num = (byte_data & 0x07) << 18 return _utf8 raise Exception("Invalid byte in UTF-8 string: 0x" + str(byte_data)) # Nestable state machine for hex escaped characters def hex_machine(emit): left = 4 num = 0 def _hex(byte_data): nonlocal num, left if 0x30 <= byte_data < 0x40: i = byte_data - 0x30 Elif 0x61 <= byte_data <= 0x66: i = byte_data - 0x57 Elif 0x41 <= byte_data <= 0x46: i = byte_data - 0x37 else: raise Exception("Expected hex char in string hex escape") left -= 1 num |= i << (left * 4) if left: return _hex return emit(num) return _hex def number_machine(byte_data, emit): sign = 1 number = 0 decimal = 0 esign = 1 exponent = 0 def _mid(byte_data): if byte_data == 0x2e: # . return _decimal return _later(byte_data) def _number(byte_data): nonlocal number if 0x30 <= byte_data < 0x40: number = number * 10 + (byte_data - 0x30) return _number return _mid(byte_data) def _start(byte_data): if byte_data == 0x30: return _mid if 0x30 < byte_data < 0x40: return _number(byte_data) raise Exception("Invalid number: 0x" + str(byte_data)) if byte_data == 0x2d: # - sign = -1 return _start def _decimal(byte_data): nonlocal decimal if 0x30 <= byte_data < 0x40: decimal = (decimal + byte_data - 0x30) / 10 return _decimal return _later(byte_data) def _later(byte_data): if byte_data == 0x45 or byte_data == 0x65: # E e return _esign return _done(byte_data) def _esign(byte_data): nonlocal esign if byte_data == 0x2b: # + return _exponent if byte_data == 0x2d: # - esign = -1 return _exponent return _exponent(byte_data) def _exponent(byte_data): nonlocal exponent if 0x30 <= byte_data < 0x40: exponent = exponent * 10 + (byte_data - 0x30) return _exponent return _done(byte_data) def _done(byte_data): value = sign * (number + decimal) if exponent: value *= math.pow(10, esign * exponent) return emit(value, byte_data) return _start(byte_data) def array_machine(emit): array_data = [] def _array(byte_data): if byte_data == 0x5d: # ] return emit(array_data) return json_machine(on_value, _comma)(byte_data) def on_value(value): array_data.append(value) def _comma(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _comma # Ignore whitespace if byte_data == 0x2c: # , return json_machine(on_value, _comma) if byte_data == 0x5d: # ] return emit(array_data) raise Exception("Unexpected byte: 0x" + str(byte_data) + " in array body") return _array def object_machine(emit): object_data = {} key = None def _object(byte_data): if byte_data == 0x7d: # return emit(object_data) return _key(byte_data) def _key(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _object # Ignore whitespace if byte_data == 0x22: return string_machine(on_key) raise Exception("Unexpected byte: 0x" + str(byte_data)) def on_key(result): nonlocal key key = result return _colon def _colon(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _colon # Ignore whitespace if byte_data == 0x3a: # : return json_machine(on_value, _comma) raise Exception("Unexpected byte: 0x" + str(byte_data)) def on_value(value): object_data[key] = value def _comma(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _comma # Ignore whitespace if byte_data == 0x2c: # , return _key if byte_data == 0x7d: # return emit(object_data) raise Exception("Unexpected byte: 0x" + str(byte_data)) return _object

Version compatible Python 2

# A streaming byte oriented JSON parser. Feed it a single byte at a time and # it will emit complete objects as it comes across them. Whitespace within and # between objects is ignored. This means it can parse newline delimited JSON. import math def json_machine(emit, next_func=None): def _value(byte_data): if not byte_data: return if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _value # Ignore whitespace if byte_data == 0x22: # " return string_machine(on_value) if byte_data == 0x2d or (0x30 <= byte_data < 0x40): # - or 0-9 return number_machine(byte_data, on_number) if byte_data == 0x7b: #: return object_machine(on_value) if byte_data == 0x5b: # [ return array_machine(on_value) if byte_data == 0x74: # t return constant_machine(TRUE, True, on_value) if byte_data == 0x66: # f return constant_machine(FALSE, False, on_value) if byte_data == 0x6e: # n return constant_machine(NULL, None, on_value) if next_func == _value: raise Exception("Unexpected 0x" + str(byte_data)) return next_func(byte_data) def on_value(value): emit(value) return next_func def on_number(number, byte): emit(number) return _value(byte) next_func = next_func or _value return _value TRUE = [0x72, 0x75, 0x65] FALSE = [0x61, 0x6c, 0x73, 0x65] NULL = [0x75, 0x6c, 0x6c] def constant_machine(bytes_data, value, emit): local_data = {"i": 0, "length": len(bytes_data)} def _constant(byte_data): # nonlocal i, length if byte_data != bytes_data[local_data["i"]]: local_data["i"] += 1 raise Exception("Unexpected 0x" + byte_data.toString(16)) local_data["i"] += 1 if local_data["i"] < local_data["length"]: return _constant return emit(value) return _constant def string_machine(emit): local_data = {"string": ""} def _string(byte_data): # nonlocal string if byte_data == 0x22: # " return emit(local_data["string"]) if byte_data == 0x5c: # \ return _escaped_string if byte_data & 0x80: # UTF-8 handling return utf8_machine(byte_data, on_char_code) if byte_data < 0x20: # ASCII control character raise Exception("Unexpected control character: 0x" + byte_data.toString(16)) local_data["string"] += chr(byte_data) return _string def _escaped_string(byte_data): # nonlocal string if byte_data == 0x22 or byte_data == 0x5c or byte_data == 0x2f: # " \ / local_data["string"] += chr(byte_data) return _string if byte_data == 0x62: # b local_data["string"] += "\b" return _string if byte_data == 0x66: # f local_data["string"] += "\f" return _string if byte_data == 0x6e: # n local_data["string"] += "
" return _string if byte_data == 0x72: # r local_data["string"] += "
" return _string if byte_data == 0x74: # t local_data["string"] += "	" return _string if byte_data == 0x75: # u return hex_machine(on_char_code) def on_char_code(char_code): # nonlocal string local_data["string"] += chr(char_code) return _string return _string # Nestable state machine for UTF-8 Decoding. def utf8_machine(byte_data, emit): local_data = {"left": 0, "num": 0} def _utf8(byte_data): # nonlocal num, left if (byte_data & 0xc0) != 0x80: raise Exception("Invalid byte in UTF-8 character: 0x" + byte_data.toString(16)) local_data["left"] -= 1 local_data["num"] |= (byte_data & 0x3f) << (local_data["left"] * 6) if local_data["left"]: return _utf8 return emit(local_data["num"]) if 0xc0 <= byte_data < 0xe0: # 2-byte UTF-8 Character local_data["left"] = 1 local_data["num"] = (byte_data & 0x1f) << 6 return _utf8 if 0xe0 <= byte_data < 0xf0: # 3-byte UTF-8 Character local_data["left"] = 2 local_data["num"] = (byte_data & 0xf) << 12 return _utf8 if 0xf0 <= byte_data < 0xf8: # 4-byte UTF-8 Character local_data["left"] = 3 local_data["num"] = (byte_data & 0x07) << 18 return _utf8 raise Exception("Invalid byte in UTF-8 string: 0x" + str(byte_data)) # Nestable state machine for hex escaped characters def hex_machine(emit): local_data = {"left": 4, "num": 0} def _hex(byte_data): # nonlocal num, left i = 0 # Parse the hex byte if 0x30 <= byte_data < 0x40: i = byte_data - 0x30 Elif 0x61 <= byte_data <= 0x66: i = byte_data - 0x57 Elif 0x41 <= byte_data <= 0x46: i = byte_data - 0x37 else: raise Exception("Expected hex char in string hex escape") local_data["left"] -= 1 local_data["num"] |= i << (local_data["left"] * 4) if local_data["left"]: return _hex return emit(local_data["num"]) return _hex def number_machine(byte_data, emit): local_data = {"sign": 1, "number": 0, "decimal": 0, "esign": 1, "exponent": 0} def _mid(byte_data): if byte_data == 0x2e: # . return _decimal return _later(byte_data) def _number(byte_data): # nonlocal number if 0x30 <= byte_data < 0x40: local_data["number"] = local_data["number"] * 10 + (byte_data - 0x30) return _number return _mid(byte_data) def _start(byte_data): if byte_data == 0x30: return _mid if 0x30 < byte_data < 0x40: return _number(byte_data) raise Exception("Invalid number: 0x" + byte_data.toString(16)) if byte_data == 0x2d: # - local_data["sign"] = -1 return _start def _decimal(byte_data): # nonlocal decimal if 0x30 <= byte_data < 0x40: local_data["decimal"] = (local_data["decimal"] + byte_data - 0x30) / 10 return _decimal return _later(byte_data) def _later(byte_data): if byte_data == 0x45 or byte_data == 0x65: # E e return _esign return _done(byte_data) def _esign(byte_data): # nonlocal esign if byte_data == 0x2b: # + return _exponent if byte_data == 0x2d: # - local_data["esign"] = -1 return _exponent return _exponent(byte_data) def _exponent(byte_data): # nonlocal exponent if 0x30 <= byte_data < 0x40: local_data["exponent"] = local_data["exponent"] * 10 + (byte_data - 0x30) return _exponent return _done(byte_data) def _done(byte_data): value = local_data["sign"] * (local_data["number"] + local_data["decimal"]) if local_data["exponent"]: value *= math.pow(10, local_data["esign"] * local_data["exponent"]) return emit(value, byte_data) return _start(byte_data) def array_machine(emit): local_data = {"array_data": []} def _array(byte_data): if byte_data == 0x5d: # ] return emit(local_data["array_data"]) return json_machine(on_value, _comma)(byte_data) def on_value(value): # nonlocal array_data local_data["array_data"].append(value) def _comma(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _comma # Ignore whitespace if byte_data == 0x2c: # , return json_machine(on_value, _comma) if byte_data == 0x5d: # ] return emit(local_data["array_data"]) raise Exception("Unexpected byte: 0x" + str(byte_data) + " in array body") return _array def object_machine(emit): local_data = {"object_data": {}, "key": ""} def _object(byte_data): # nonlocal object_data, key if byte_data == 0x7d: # return emit(local_data["object_data"]) return _key(byte_data) def _key(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _object # Ignore whitespace if byte_data == 0x22: return string_machine(on_key) raise Exception("Unexpected byte: 0x" + byte_data.toString(16)) def on_key(result): # nonlocal object_data, key local_data["key"] = result return _colon def _colon(byte_data): # nonlocal object_data, key if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _colon # Ignore whitespace if byte_data == 0x3a: # : return json_machine(on_value, _comma) raise Exception("Unexpected byte: 0x" + str(byte_data)) def on_value(value): # nonlocal object_data, key local_data["object_data"][local_data["key"]] = value def _comma(byte_data): # nonlocal object_data if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _comma # Ignore whitespace if byte_data == 0x2c: # , return _key if byte_data == 0x7d: # return emit(local_data["object_data"]) raise Exception("Unexpected byte: 0x" + str(byte_data)) return _object

Le tester

if __== "__main__": test_json = """[1,2,"3"] {"name": "tarun"} 1 2 3 [{"name":"a", "data": [1, null,2]}] """ def found_json(data): print(data) state = json_machine(found_json) for char in test_json: state = state(ord(char))

La sortie de la même chose est

[1, 2, '3'] {'name': 'tarun'} 1 2 3 [{'name': 'a', 'data': [1, None, 2]}]

wuliang · Answer

J'aimerais apporter une solution. La clé consiste à "essayer" de décoder: si cela échoue, donnez-lui plus de flux, sinon utilisez les informations de décalage pour préparer le décodage suivant.

Cependant, le module json actuel ne peut tolérer que SPACE soit décodé en tête de chaîne. Je dois donc le déshabiller.

import sys import json def iterload(file): buffer = "" dec = json.JSONDecoder() for line in file: buffer = buffer.strip(" 

	") + line.strip(" 

	") while(True): try: r = dec.raw_decode(buffer) except: break yield r[0] buffer = buffer[r[1]:].strip(" 

	") for o in iterload(sys.stdin): print("Working on a", type(o), o)

========================= J'ai testé plusieurs fichiers txt, et cela fonctionne bien. (in1.txt)

{"foo": ["bar", "baz"] } 1 2 [ ] 4 {"foo1": ["bar1", {"foo2":{"A":1, "B":3}, "DDD":4}] } 5 6

(in2.txt)

{"foo" : ["bar", "baz"] } 1 2 [ ] 4 5 6

(in.txt, votre initiale)

{"foo": ["bar", "baz"]} 1 2 [] 4 5 6

(sortie pour le test de Benedict)

python test.py < in.txt ('Working on a', <type 'list'>, [u'hello']) ('Working on a', <type 'dict'>, {u'goodbye': 1}) ('Working on a', <type 'int'>, 1) ('Working on a', <type 'int'>, 2) ('Working on a', <type 'dict'>, {}) ('Working on a', <type 'int'>, 2) ('Working on a', <type 'int'>, 9) ('Working on a', <type 'int'>, 78) ('Working on a', <type 'int'>, 4) ('Working on a', <type 'int'>, 5) ('Working on a', <type 'dict'>, {u'animals': [u'dog', u'lots of mice', u'cat']})

sigpwned · Answer

J'ai utilisé la solution élégante de @ wuilang. L’approche simple - lire un octet, essayer de décoder, lire un octet, essayer de décoder, ... - a fonctionné, mais malheureusement, elle était très lente.

Dans mon cas, j'essayais de lire des objets JSON "joliment imprimés" du même type d'objet à partir d'un fichier. Cela m'a permis d'optimiser l'approche. Je pouvais lire le fichier ligne par ligne, en décodant uniquement lorsque je trouvais une ligne contenant exactement "}":

def iterload(stream): buf = "" dec = json.JSONDecoder() for line in stream: line = line.rstrip() buf = buf + line if line == "}": yield dec.raw_decode(buf) buf = ""

Si vous travaillez avec un JSON compact à une ligne qui échappe aux nouvelles lignes dans les littéraux de chaîne, vous pouvez simplifier encore davantage cette approche en toute sécurité:

def iterload(stream): dec = json.JSONDecoder() for line in stream: yield dec.raw_decode(line)

De toute évidence, ces approches simples ne fonctionnent que pour des types de JSON très spécifiques. Toutefois, si ces hypothèses sont valables, ces solutions fonctionnent correctement et rapidement.

user3542882 · Answer

Voilà le mien:

import simplejson as json from simplejson import JSONDecodeError class StreamJsonListLoader(): """ When you have a big JSON file containint a list, such as [{ ... }, { ... }, { ... }, ... ] And it's too big to be practically loaded into memory and parsed by json.load, This class comes to the rescue. It lets you lazy-load the large json list. """ def __init__(self, filename_or_stream): if type(filename_or_stream) == str: self.stream = open(filename_or_stream) else: self.stream = filename_or_stream if not self.stream.read(1) == '[': raise NotImplementedError('Only JSON-streams of lists (that start with a [) are supported.') def __iter__(self): return self def next(self): read_buffer = self.stream.read(1) while True: try: json_obj = json.loads(read_buffer) if not self.stream.read(1) in [',',']']: raise Exception('JSON seems to be malformed: object is not followed by comma (,) or end of list (]).') return json_obj except JSONDecodeError: next_char = self.stream.read(1) read_buffer += next_char while next_char != '}': next_char = self.stream.read(1) if next_char == '': raise StopIteration read_buffer += next_char

hetepeperfan · Answer

Si vous utilisez une instance de json.JSONDecoder, vous pouvez utiliser raw_decode fonction membre. Il retourne un tuple de python) représentation de la valeur JSON et un index indiquant l'endroit où l'analyse s'est arrêtée. Cela facilite la découpe (ou la recherche dans un objet de flux) des valeurs JSON restantes. Je ne suis pas très content de la boucle while supplémentaire qui permet de sauter l’espace blanc entre les différentes valeurs JSON de l’entrée, mais à mon avis, le travail est fait.

import json def yield_multiple_value(f): ''' parses multiple JSON values from a file. ''' vals_str = f.read() decoder = json.JSONDecoder() try: nread = 0 while nread < len(vals_str): val, n = decoder.raw_decode(vals_str[nread:]) nread += n # Skip over whitespace because of bug, below. while nread < len(vals_str) and vals_str[nread].isspace(): nread += 1 yield val except json.JSONDecodeError as e: pass return

La version suivante est beaucoup plus courte et consomme la partie de la chaîne déjà analysée. Il semble que pour une raison quelconque, un deuxième appel json.JSONDecoder.raw_decode () semble échouer lorsque le premier caractère de la chaîne est un espace, c’est aussi la raison pour laquelle je saute l’espace au-dessus de whileloop ...

def yield_multiple_value(f): ''' parses multiple JSON values from a file. ''' vals_str = f.read() decoder = json.JSONDecoder() while vals_str: val, n = decoder.raw_decode(vals_str) #remove the read characters from the start. vals_str = vals_str[n:] # remove leading white space because a second call to decoder.raw_decode() # fails when the string starts with whitespace, and # I don't understand why... vals_str = vals_str.lstrip() yield val return

Dans la documentation sur la classe json.JSONDecoder, la méthode raw_decode https://docs.python.org/3/library/json.html#encoders-and-decoders contient les éléments suivants:

Cela peut être utilisé pour décoder un document JSON à partir d'une chaîne pouvant contenir des données superflues à la fin.

Et ces données superflues peuvent facilement être une autre valeur JSON. En d'autres termes, la méthode peut être écrite avec cet objectif en tête.

Avec le fichier input.txt utilisant la fonction supérieure, j'obtiens l'exemple de sortie présenté dans la question d'origine.