web-dev-qa-db-fra.com

MemoryError dans TensorFlow; et "le noeud NUMA réussi lu à partir de SysFS avait une valeur négative (-1)" avec xen

J'utilise la version tensor flow:

0.12.1

La version du jeu d'outils Cuda est 8.

lrwxrwxrwx  1 root root   19 May 28 17:27 cuda -> /usr/local/cuda-8.0

Tel que documenté ici J'ai téléchargé et installé cuDNN. Mais en exécutant la ligne suivante de mon script python, je reçois des messages d'erreur mentionnés dans l'en-tête:

  model.fit_generator(train_generator,
   steps_per_Epoch= len(train_samples),
   validation_data=validation_generator, 
   validation_steps=len(validation_samples),
   epochs=9)

Le message d'erreur détaillé est le suivant:

Using TensorFlow backend. 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally 
Epoch 1/9 Exception in thread Thread-1: Traceback (most recent call last):   File " lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()   File " lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)   File " lib/python3.5/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator) StopIteration

I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), 
 but there must be at least one NUMA node, so returning NUMA node zero 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] 
Found device 0 with properties: name: GRID K520 major: 3 minor: 0 memoryClockRate (GHz) 0.797 pciBusID 0000:00:03.0 Total memory: 3.94GiB Free memory:
3.91GiB 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] 
 Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0) 
Traceback (most recent call last):   File "model_new.py", line 82, in <module>
    model.fit_generator(train_generator, steps_per_Epoch= len(train_samples),validation_data=validation_generator, validation_steps=len(validation_samples),epochs=9)   File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
    return func(*args, **kwargs)   File " lib/python3.5/site-packages/keras/models.py", line 1110, in fit_generator
    initial_Epoch=initial_Epoch)   File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
    return func(*args, **kwargs)   File " lib/python3.5/site-packages/keras/engine/training.py", line 1890, in fit_generator
    class_weight=class_weight)   File " lib/python3.5/site-packages/keras/engine/training.py", line 1633, in train_on_batch
    outputs = self.train_function(ins)   File " lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2229, in __call__
    feed_dict=feed_dict)   File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 766, in run
    run_metadata_ptr)   File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 937, in _run
    np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)   File " lib/python3.5/site-packages/numpy/core/numeric.py", line 531, in asarray
    return array(a, dtype, copy=False, order=order) MemoryError

Si une suggestion pour résoudre cette erreur est appréciée.

EDIT: Le problème est fatal.

uname -a
Linux ip-172-31-76-109 4.4.0-78-generic #99-Ubuntu SMP
Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Sudo lshw -short
[Sudo] password for carnd:
H/W path    Device  Class      Description
==========================================
                    system     HVM domU
/0                  bus        Motherboard
/0/0                memory     96KiB BIOS
/0/401              processor  Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
/0/402              processor  CPU
/0/403              processor  CPU
/0/404              processor  CPU
/0/405              processor  CPU
/0/406              processor  CPU
/0/407              processor  CPU
/0/408              processor  CPU
/0/1000             memory     15GiB System Memory
/0/1000/0           memory     15GiB DIMM RAM
/0/100              bridge     440FX - 82441FX PMC [Natoma]
/0/100/1            bridge     82371SB PIIX3 ISA [Natoma/Triton II]
/0/100/1.1          storage    82371SB PIIX3 IDE [Natoma/Triton II]
/0/100/1.3          bridge     82371AB/EB/MB PIIX4 ACPI
/0/100/2            display    Gd 5446
/0/100/3            display    GK104GL [GRID K520]
/0/100/1f           generic    Xen Platform Device
/1          eth0    network    Ethernet interface

EDIT 2:

Il s'agit d'une instance EC2 dans le cloud Amazon. Et tous les fichiers contenant la valeur -1.

:/sys$ find . -name numa_node -exec cat '{}' \;
find: ‘./fs/Fuse/connections/39’: Permission denied
-1
-1
-1
-1
-1
-1
-1
find: ‘./kernel/debug’: Permission denied

EDIT3: Après la mise à jour des fichiers numa_nod, l'erreur liée à NUMA a disparu. Mais toutes les autres erreurs précédentes répertoriées ci-dessus restent. Et encore une fois, j'ai eu une erreur fatale.

Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
Epoch 1/9
Exception in thread Thread-1:
Traceback (most recent call last):
  File " lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File " lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File " lib/python3.5/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration

I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:03.0
Total memory: 3.94GiB
Free memory: 3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
Traceback (most recent call last):
  File "model_new.py", line 85, in <module>
    model.fit_generator(train_generator, steps_per_Epoch= len(train_samples),validation_data=validation_generator, validation_steps=len(validation_samples),epochs=9)
  File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
    return func(*args, **kwargs)
  File " lib/python3.5/site-packages/keras/models.py", line 1110, in fit_generator
    initial_Epoch=initial_Epoch)
  File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
    return func(*args, **kwargs)
  File " lib/python3.5/site-packages/keras/engine/training.py", line 1890, in fit_generator
    class_weight=class_weight)
  File " lib/python3.5/site-packages/keras/engine/training.py", line 1633, in train_on_batch
    outputs = self.train_function(ins)
  File " lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2229, in __call__
    feed_dict=feed_dict)
  File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 766, in run
    run_metadata_ptr)
  File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 937, in _run
    np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)
  File " lib/python3.5/site-packages/numpy/core/numeric.py", line 531, in asarray
    return array(a, dtype, copy=False, order=order)
MemoryError
10
Steephen

Il y a le code qui affiche le message "La lecture réussie du nœud NUMA à partir de SysFS avait une valeur négative (-1)", et ce n'est pas une erreur fatale, c'est juste un avertissement. La vraie erreur est MemoryError dans votre File "model_new.py", line 85, in <module>. Nous avons besoin de plus de sources pour vérifier cette erreur. Essayez de réduire la taille de votre modèle ou de l'exécuter sur un serveur avec plus de RAM.


À propos de l'avertissement de nœud NUMA:

https://github.com/tensorflow/tensorflow/blob/e4296aefff97e6edd3d7cee9a09b9dd77da4c034/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc#L855

// Attempts to read the NUMA node corresponding to the GPU device's PCI bus out
// of SysFS. Returns -1 if it cannot...
static int TryToReadNumaNode(const string &pci_bus_id, int device_ordinal) 
{...
  string filename =
      port::Printf("/sys/bus/pci/devices/%s/numa_node", pci_bus_id.c_str());
  FILE *file = fopen(filename.c_str(), "r");
  if (file == nullptr) {
    LOG(ERROR) << "could not open file to read NUMA node: " << filename
               << "\nYour kernel may have been built without NUMA support.";
    return kUnknownNumaNode;
  } ...
  if (port::safe_strto32(content, &value)) {
    if (value < 0) {  // See http://b/18228951 for details on this path.
      LOG(INFO) << "successful NUMA node read from SysFS had negative value ("
                << value << "), but there must be at least one NUMA node"
                            ", so returning NUMA node zero";
      fclose(file);
      return 0;
    }

TensorFlow a pu ouvrir le fichier /sys/bus/pci/devices/%s/numa_node Où% s est l'ID de la carte PCI GPU ( string pci_bus_id = CUDADriver::GetPCIBusID(device_) ). Votre PC n'est pas multisocket, il n'y a qu'un seul socket CPU avec Xeon E5-2670 à 8 cœurs installé, donc cet identifiant doit être '0' (un nœud NUMA unique est numéroté 0 sous Linux), mais le message d'erreur indique qu'il était -1 Valeur dans ce fichier!

Donc, nous savons que sysfs est monté dans /sys, Il y a numa_node Fichier spécial, CONFIG_NUMA est activé dans votre configuration du noyau Linux (zgrep NUMA /boot/config* /proc/config*). En fait, il est activé: CONFIG_NUMA=y - dans le deb de votre noyau x86_64 4.4.0-78-generic

Le fichier spécial numa_node Est documenté dans https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-bus-pci ( L'ACPI de votre PC est-il incorrect? )

What:       /sys/bus/pci/devices/.../numa_node
Date:       Oct 2014
Contact:    Prarit Bhargava <[email protected]>
Description:
        This file contains the NUMA node to which the PCI device is
        attached, or -1 if the node is unknown.  The initial value
        comes from an ACPI _PXM method or a similar firmware
        source.  If that is missing or incorrect, this file can be
        written to override the node.  In that case, please report
        a firmware bug to the system vendor.  Writing to this file
        taints the kernel with TAINT_FIRMWARE_WORKAROUND, which
        reduces the supportability of your system.

Il existe une solution rapide ( kludge ) pour cette erreur: recherchez le numa_node De votre GPU et avec le compte root, effectuez après chaque démarrage cette commande où NNNNN est l'ID PCI de votre carte (recherchez dans la sortie lspci et dans le répertoire /sys/bus/pci/devices/)

echo 0 | Sudo tee -a /sys/bus/pci/devices/NNNNN/numa_node

Ou simplement l'écho dans chacun de ces fichiers, il devrait être plutôt sûr:

for a in /sys/bus/pci/devices/*; do echo 0 | Sudo tee -a $a/numa_node; done

Votre lshw montre également qu'il ne s'agit pas d'un PC, mais d'un invité virtuel Xen. Il y a un problème entre l'émulation de la plateforme Xen (ACPI) et le code de prise en charge NUMA du bus PCI Linux.

14
osgx