J'utilise la version tensor flow:
0.12.1
La version du jeu d'outils Cuda est 8.
lrwxrwxrwx 1 root root 19 May 28 17:27 cuda -> /usr/local/cuda-8.0
Tel que documenté ici J'ai téléchargé et installé cuDNN. Mais en exécutant la ligne suivante de mon script python, je reçois des messages d'erreur mentionnés dans l'en-tête:
model.fit_generator(train_generator,
steps_per_Epoch= len(train_samples),
validation_data=validation_generator,
validation_steps=len(validation_samples),
epochs=9)
Le message d'erreur détaillé est le suivant:
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
Epoch 1/9 Exception in thread Thread-1: Traceback (most recent call last): File " lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run() File " lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs) File " lib/python3.5/site-packages/keras/engine/training.py", line 612, in data_generator_task
generator_output = next(self._generator) StopIteration
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1),
but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885]
Found device 0 with properties: name: GRID K520 major: 3 minor: 0 memoryClockRate (GHz) 0.797 pciBusID 0000:00:03.0 Total memory: 3.94GiB Free memory:
3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975]
Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
Traceback (most recent call last): File "model_new.py", line 82, in <module>
model.fit_generator(train_generator, steps_per_Epoch= len(train_samples),validation_data=validation_generator, validation_steps=len(validation_samples),epochs=9) File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
return func(*args, **kwargs) File " lib/python3.5/site-packages/keras/models.py", line 1110, in fit_generator
initial_Epoch=initial_Epoch) File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
return func(*args, **kwargs) File " lib/python3.5/site-packages/keras/engine/training.py", line 1890, in fit_generator
class_weight=class_weight) File " lib/python3.5/site-packages/keras/engine/training.py", line 1633, in train_on_batch
outputs = self.train_function(ins) File " lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2229, in __call__
feed_dict=feed_dict) File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr) File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 937, in _run
np_val = np.asarray(subfeed_val, dtype=subfeed_dtype) File " lib/python3.5/site-packages/numpy/core/numeric.py", line 531, in asarray
return array(a, dtype, copy=False, order=order) MemoryError
Si une suggestion pour résoudre cette erreur est appréciée.
EDIT: Le problème est fatal.
uname -a
Linux ip-172-31-76-109 4.4.0-78-generic #99-Ubuntu SMP
Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Sudo lshw -short
[Sudo] password for carnd:
H/W path Device Class Description
==========================================
system HVM domU
/0 bus Motherboard
/0/0 memory 96KiB BIOS
/0/401 processor Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
/0/402 processor CPU
/0/403 processor CPU
/0/404 processor CPU
/0/405 processor CPU
/0/406 processor CPU
/0/407 processor CPU
/0/408 processor CPU
/0/1000 memory 15GiB System Memory
/0/1000/0 memory 15GiB DIMM RAM
/0/100 bridge 440FX - 82441FX PMC [Natoma]
/0/100/1 bridge 82371SB PIIX3 ISA [Natoma/Triton II]
/0/100/1.1 storage 82371SB PIIX3 IDE [Natoma/Triton II]
/0/100/1.3 bridge 82371AB/EB/MB PIIX4 ACPI
/0/100/2 display Gd 5446
/0/100/3 display GK104GL [GRID K520]
/0/100/1f generic Xen Platform Device
/1 eth0 network Ethernet interface
EDIT 2:
Il s'agit d'une instance EC2 dans le cloud Amazon. Et tous les fichiers contenant la valeur -1.
:/sys$ find . -name numa_node -exec cat '{}' \;
find: ‘./fs/Fuse/connections/39’: Permission denied
-1
-1
-1
-1
-1
-1
-1
find: ‘./kernel/debug’: Permission denied
EDIT3: Après la mise à jour des fichiers numa_nod, l'erreur liée à NUMA a disparu. Mais toutes les autres erreurs précédentes répertoriées ci-dessus restent. Et encore une fois, j'ai eu une erreur fatale.
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
Epoch 1/9
Exception in thread Thread-1:
Traceback (most recent call last):
File " lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File " lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File " lib/python3.5/site-packages/keras/engine/training.py", line 612, in data_generator_task
generator_output = next(self._generator)
StopIteration
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:03.0
Total memory: 3.94GiB
Free memory: 3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
Traceback (most recent call last):
File "model_new.py", line 85, in <module>
model.fit_generator(train_generator, steps_per_Epoch= len(train_samples),validation_data=validation_generator, validation_steps=len(validation_samples),epochs=9)
File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
return func(*args, **kwargs)
File " lib/python3.5/site-packages/keras/models.py", line 1110, in fit_generator
initial_Epoch=initial_Epoch)
File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
return func(*args, **kwargs)
File " lib/python3.5/site-packages/keras/engine/training.py", line 1890, in fit_generator
class_weight=class_weight)
File " lib/python3.5/site-packages/keras/engine/training.py", line 1633, in train_on_batch
outputs = self.train_function(ins)
File " lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2229, in __call__
feed_dict=feed_dict)
File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 937, in _run
np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)
File " lib/python3.5/site-packages/numpy/core/numeric.py", line 531, in asarray
return array(a, dtype, copy=False, order=order)
MemoryError
Il y a le code qui affiche le message "La lecture réussie du nœud NUMA à partir de SysFS avait une valeur négative (-1)", et ce n'est pas une erreur fatale, c'est juste un avertissement. La vraie erreur est MemoryError
dans votre File "model_new.py", line 85, in <module>
. Nous avons besoin de plus de sources pour vérifier cette erreur. Essayez de réduire la taille de votre modèle ou de l'exécuter sur un serveur avec plus de RAM.
À propos de l'avertissement de nœud NUMA:
// Attempts to read the NUMA node corresponding to the GPU device's PCI bus out
// of SysFS. Returns -1 if it cannot...
static int TryToReadNumaNode(const string &pci_bus_id, int device_ordinal)
{...
string filename =
port::Printf("/sys/bus/pci/devices/%s/numa_node", pci_bus_id.c_str());
FILE *file = fopen(filename.c_str(), "r");
if (file == nullptr) {
LOG(ERROR) << "could not open file to read NUMA node: " << filename
<< "\nYour kernel may have been built without NUMA support.";
return kUnknownNumaNode;
} ...
if (port::safe_strto32(content, &value)) {
if (value < 0) { // See http://b/18228951 for details on this path.
LOG(INFO) << "successful NUMA node read from SysFS had negative value ("
<< value << "), but there must be at least one NUMA node"
", so returning NUMA node zero";
fclose(file);
return 0;
}
TensorFlow a pu ouvrir le fichier /sys/bus/pci/devices/%s/numa_node
Où% s est l'ID de la carte PCI GPU ( string pci_bus_id = CUDADriver::GetPCIBusID(device_)
). Votre PC n'est pas multisocket, il n'y a qu'un seul socket CPU avec Xeon E5-2670 à 8 cœurs installé, donc cet identifiant doit être '0' (un nœud NUMA unique est numéroté 0 sous Linux), mais le message d'erreur indique qu'il était -1
Valeur dans ce fichier!
Donc, nous savons que sysfs est monté dans /sys
, Il y a numa_node
Fichier spécial, CONFIG_NUMA est activé dans votre configuration du noyau Linux (zgrep NUMA /boot/config* /proc/config*
). En fait, il est activé: CONFIG_NUMA=y
- dans le deb de votre noyau x86_64 4.4.0-78-generic
Le fichier spécial numa_node
Est documenté dans https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-bus-pci ( L'ACPI de votre PC est-il incorrect? )
What: /sys/bus/pci/devices/.../numa_node
Date: Oct 2014
Contact: Prarit Bhargava <[email protected]>
Description:
This file contains the NUMA node to which the PCI device is
attached, or -1 if the node is unknown. The initial value
comes from an ACPI _PXM method or a similar firmware
source. If that is missing or incorrect, this file can be
written to override the node. In that case, please report
a firmware bug to the system vendor. Writing to this file
taints the kernel with TAINT_FIRMWARE_WORKAROUND, which
reduces the supportability of your system.
Il existe une solution rapide ( kludge ) pour cette erreur: recherchez le numa_node
De votre GPU et avec le compte root, effectuez après chaque démarrage cette commande où NNNNN est l'ID PCI de votre carte (recherchez dans la sortie lspci
et dans le répertoire /sys/bus/pci/devices/
)
echo 0 | Sudo tee -a /sys/bus/pci/devices/NNNNN/numa_node
Ou simplement l'écho dans chacun de ces fichiers, il devrait être plutôt sûr:
for a in /sys/bus/pci/devices/*; do echo 0 | Sudo tee -a $a/numa_node; done
Votre lshw
montre également qu'il ne s'agit pas d'un PC, mais d'un invité virtuel Xen. Il y a un problème entre l'émulation de la plateforme Xen (ACPI) et le code de prise en charge NUMA du bus PCI Linux.