MemoryError dans TensorFlow; et "le noeud NUMA réussi lu à partir de SysFS avait une valeur négative (-1)" avec xen

Question

J'utilise la version tensor flow:

0.12.1

La version du jeu d'outils Cuda est 8.

lrwxrwxrwx 1 root root 19 May 28 17:27 cuda -> /usr/local/cuda-8.0

Tel que documenté ici J'ai téléchargé et installé cuDNN. Mais en exécutant la ligne suivante de mon script python, je reçois des messages d'erreur mentionnés dans l'en-tête:

 model.fit_generator(train_generator, steps_per_Epoch= len(train_samples), validation_data=validation_generator, validation_steps=len(validation_samples), epochs=9)

Le message d'erreur détaillé est le suivant:

Using TensorFlow backend. I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally Epoch 1/9 Exception in thread Thread-1: Traceback (most recent call last): File " lib/python3.5/threading.py", line 914, in _bootstrap_inner self.run() File " lib/python3.5/threading.py", line 862, in run self._target(*self._args, **self._kwargs) File " lib/python3.5/site-packages/keras/engine/training.py", line 612, in data_generator_task generator_output = next(self._generator) StopIteration I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: name: GRID K520 major: 3 minor: 0 memoryClockRate (GHz) 0.797 pciBusID 0000:00:03.0 Total memory: 3.94GiB Free memory: 3.91GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0) Traceback (most recent call last): File "model_new.py", line 82, in <module> model.fit_generator(train_generator, steps_per_Epoch= len(train_samples),validation_data=validation_generator, validation_steps=len(validation_samples),epochs=9) File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper return func(*args, **kwargs) File " lib/python3.5/site-packages/keras/models.py", line 1110, in fit_generator initial_Epoch=initial_Epoch) File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper return func(*args, **kwargs) File " lib/python3.5/site-packages/keras/engine/training.py", line 1890, in fit_generator class_weight=class_weight) File " lib/python3.5/site-packages/keras/engine/training.py", line 1633, in train_on_batch outputs = self.train_function(ins) File " lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2229, in __call__ feed_dict=feed_dict) File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 766, in run run_metadata_ptr) File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 937, in _run np_val = np.asarray(subfeed_val, dtype=subfeed_dtype) File " lib/python3.5/site-packages/numpy/core/numeric.py", line 531, in asarray return array(a, dtype, copy=False, order=order) MemoryError

Si une suggestion pour résoudre cette erreur est appréciée.

EDIT: Le problème est fatal.

uname -a Linux ip-172-31-76-109 4.4.0-78-generic #99-Ubuntu SMP Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Sudo lshw -short [Sudo] password for carnd: H/W path Device Class Description ========================================== system HVM domU /0 bus Motherboard /0/0 memory 96KiB BIOS /0/401 processor Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz /0/402 processor CPU /0/403 processor CPU /0/404 processor CPU /0/405 processor CPU /0/406 processor CPU /0/407 processor CPU /0/408 processor CPU /0/1000 memory 15GiB System Memory /0/1000/0 memory 15GiB DIMM RAM /0/100 bridge 440FX - 82441FX PMC [Natoma] /0/100/1 bridge 82371SB PIIX3 ISA [Natoma/Triton II] /0/100/1.1 storage 82371SB PIIX3 IDE [Natoma/Triton II] /0/100/1.3 bridge 82371AB/EB/MB PIIX4 ACPI /0/100/2 display Gd 5446 /0/100/3 display GK104GL [GRID K520] /0/100/1f generic Xen Platform Device /1 eth0 network Ethernet interface

EDIT 2:

Il s'agit d'une instance EC2 dans le cloud Amazon. Et tous les fichiers contenant la valeur -1.

:/sys$ find . -name numa_node -exec cat '{}' \; find: ‘./fs/Fuse/connections/39’: Permission denied -1 -1 -1 -1 -1 -1 -1 find: ‘./kernel/debug’: Permission denied

EDIT3: Après la mise à jour des fichiers numa_nod, l'erreur liée à NUMA a disparu. Mais toutes les autres erreurs précédentes répertoriées ci-dessus restent. Et encore une fois, j'ai eu une erreur fatale.

Using TensorFlow backend. I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally Epoch 1/9 Exception in thread Thread-1: Traceback (most recent call last): File " lib/python3.5/threading.py", line 914, in _bootstrap_inner self.run() File " lib/python3.5/threading.py", line 862, in run self._target(*self._args, **self._kwargs) File " lib/python3.5/site-packages/keras/engine/training.py", line 612, in data_generator_task generator_output = next(self._generator) StopIteration I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: name: GRID K520 major: 3 minor: 0 memoryClockRate (GHz) 0.797 pciBusID 0000:00:03.0 Total memory: 3.94GiB Free memory: 3.91GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0) Traceback (most recent call last): File "model_new.py", line 85, in <module> model.fit_generator(train_generator, steps_per_Epoch= len(train_samples),validation_data=validation_generator, validation_steps=len(validation_samples),epochs=9) File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper return func(*args, **kwargs) File " lib/python3.5/site-packages/keras/models.py", line 1110, in fit_generator initial_Epoch=initial_Epoch) File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper return func(*args, **kwargs) File " lib/python3.5/site-packages/keras/engine/training.py", line 1890, in fit_generator class_weight=class_weight) File " lib/python3.5/site-packages/keras/engine/training.py", line 1633, in train_on_batch outputs = self.train_function(ins) File " lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2229, in __call__ feed_dict=feed_dict) File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 766, in run run_metadata_ptr) File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 937, in _run np_val = np.asarray(subfeed_val, dtype=subfeed_dtype) File " lib/python3.5/site-packages/numpy/core/numeric.py", line 531, in asarray return array(a, dtype, copy=False, order=order) MemoryError

osgx · Answer

Il y a le code qui affiche le message "La lecture réussie du nœud NUMA à partir de SysFS avait une valeur négative (-1)", et ce n'est pas une erreur fatale, c'est juste un avertissement. La vraie erreur est MemoryError dans votre File "model_new.py", line 85, in <module>. Nous avons besoin de plus de sources pour vérifier cette erreur. Essayez de réduire la taille de votre modèle ou de l'exécuter sur un serveur avec plus de RAM.

À propos de l'avertissement de nœud NUMA:

https://github.com/tensorflow/tensorflow/blob/e4296aefff97e6edd3d7cee9a09b9dd77da4c034/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc#L855

// Attempts to read the NUMA node corresponding to the GPU device's PCI bus out // of SysFS. Returns -1 if it cannot... static int TryToReadNumaNode(const string &pci_bus_id, int device_ordinal) {... string filename = port::Printf("/sys/bus/pci/devices/%s/numa_node", pci_bus_id.c_str()); FILE *file = fopen(filename.c_str(), "r"); if (file == nullptr) { LOG(ERROR) << "could not open file to read NUMA node: " << filename << "
Your kernel may have been built without NUMA support."; return kUnknownNumaNode; } ... if (port::safe_strto32(content, &value)) { if (value < 0) { // See http://b/18228951 for details on this path. LOG(INFO) << "successful NUMA node read from SysFS had negative value (" << value << "), but there must be at least one NUMA node" ", so returning NUMA node zero"; fclose(file); return 0; }

TensorFlow a pu ouvrir le fichier /sys/bus/pci/devices/%s/numa_node Où% s est l'ID de la carte PCI GPU ( string pci_bus_id = CUDADriver::GetPCIBusID(device_) ). Votre PC n'est pas multisocket, il n'y a qu'un seul socket CPU avec Xeon E5-2670 à 8 cœurs installé, donc cet identifiant doit être '0' (un nœud NUMA unique est numéroté 0 sous Linux), mais le message d'erreur indique qu'il était -1 Valeur dans ce fichier!

Donc, nous savons que sysfs est monté dans /sys, Il y a numa_node Fichier spécial, CONFIG_NUMA est activé dans votre configuration du noyau Linux (zgrep NUMA /boot/config* /proc/config*). En fait, il est activé: CONFIG_NUMA=y - dans le deb de votre noyau x86_64 4.4.0-78-generic

Le fichier spécial numa_node Est documenté dans https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-bus-pci ( L'ACPI de votre PC est-il incorrect? )

What: /sys/bus/pci/devices/.../numa_node Date: Oct 2014 Contact: Prarit Bhargava <prarit@redhat.com> Description: This file contains the NUMA node to which the PCI device is attached, or -1 if the node is unknown. The initial value comes from an ACPI _PXM method or a similar firmware source. If that is missing or incorrect, this file can be written to override the node. In that case, please report a firmware bug to the system vendor. Writing to this file taints the kernel with TAINT_FIRMWARE_WORKAROUND, which reduces the supportability of your system.

Il existe une solution rapide ( kludge ) pour cette erreur: recherchez le numa_node De votre GPU et avec le compte root, effectuez après chaque démarrage cette commande où NNNNN est l'ID PCI de votre carte (recherchez dans la sortie lspci et dans le répertoire /sys/bus/pci/devices/)

echo 0 | Sudo tee -a /sys/bus/pci/devices/NNNNN/numa_node

Ou simplement l'écho dans chacun de ces fichiers, il devrait être plutôt sûr:

for a in /sys/bus/pci/devices/*; do echo 0 | Sudo tee -a $a/numa_node; done

Votre lshw montre également qu'il ne s'agit pas d'un PC, mais d'un invité virtuel Xen. Il y a un problème entre l'émulation de la plateforme Xen (ACPI) et le code de prise en charge NUMA du bus PCI Linux.