gpu服务器（器服务收藏）

用户投稿 11个月前 (09-06) 97浏览

问题是这样产生的：

实验室师姐在GPU服务器上跑深度学习代码时报错说没有padas这个模块，于是给她开了sudo权限让她自己pip install一下。然后我就没管了。

后来据她说她把padas重新卸载了gpu服务器，然后又重新装了之后，导致keras不能用了。具体什么原因我也不知道，我直接一上服务器就先安装keras，结果发现服务器上已经有keras这个模块了。

gpu服务器

然后我进入python环境下测试keras和TensorFlow是否能用，但但但就是报了如下所示的错误：

gpu服务器

查阅资料后发现是服务器GPU的版本与TensorFlow的版本不一致。实验室的GPU服务器上cuda版本是10.0，python版本是3.8，TensorFlow的版本不记得了，也没记得先pip list查看一下，然后就直接卸载了。参考说10.0的cuda应该与tensorflow_gpu-1.14.0匹配，但是这是在python版本为3.7的前提下，所以这个参考价值不大。于是和我一起的同伴回忆了一下，咱们之前重装服务器，这些都是自动匹配的，根本没考虑过指定版本的事，于是就直接pip install 了。

# 查看cuda的版本号
-bash-4.2# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
# 查看python版本号
-bash-4.2# python -V
Python 3.8.3

完全卸载TensorFlow用的是下面的命令。

# 完全卸载tensorflow，我是在root权限下
-bash-4.2# pip uninstall protobuf
-bash-4.2# pip uninstall tensorflow

重新安装TensorFlow。

gpu服务器

最后，在网上找了一个手写识别的keras代码来测试，成功！！！！！！！

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2020/12/15 17:28
# @Author  : Qiufen.Chen
# @Email   : 1760812842@qq.com
# @File    : test.py
# @Software: PyCharm
import os
import tensorflow as tf
gpu_id = '5'
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
os.system('echo $CUDA_VISIBLE_DEVICES')
tf_config =  tf.ConfigProto()
tf_config.gpu_options.allow_growth = True
tf.Session(config=tf_config)
from keras.models import Sequential
from keras.layers import Dense,Flatten
from keras.layers.convolutional import Conv2D, MaxPooling2D
from keras.utils.np_utils import to_categorical
from keras.datasets import mnist
from keras.utils import plot_model
(X_train,y_train),(X_test,y_test)=mnist.load_data()
#print(X_train.shape,y_train.shape)
X_train = X_train.reshape(-1,28, 28,1)
X_test = X_test.reshape(-1, 28, 28,1)
#print(X_train.shape,y_train.shape)
y_train = to_categorical(y_train, num_classes=10)
y_test = to_categorical(y_test, num_classes=10)
print(X_train.shape,y_train.shape)
model = Sequential()
#layer2
model.add(Conv2D(6, (3,3),strides=(1,1),input_shape=X_train.shape[1:],data_format='channels_last',padding='valid',activation='relu',kernel_initializer='uniform'))
#layer3
model.add(MaxPooling2D((2,2)))
#layer4
model.add(Conv2D(16, (3,3),strides=(1,1),data_format='channels_last',padding='valid',activation='relu',kernel_initializer='uniform'))
#layer5
model.add(MaxPooling2D(2,2))
#layer6
model.add(Conv2D(120, (5,5),strides=(1,1),data_format='channels_last',padding='valid',activation='relu',kernel_initializer='uniform'))
model.add(Flatten())
#layer7
model.add(Dense(84,activation='relu'))
#layer8
model.add(Dense(10,activation='softmax'))
#print
model.summary()
model.compile(optimizer='sgd',loss='categorical_crossentropy',metrics=['accuracy'])
print("train____________")
model.fit(X_train,y_train,epochs=1,batch_size=128,)
print("test_____________")
loss,acc=model.evaluate(X_test,y_test)
print("loss=",loss)
print("accuracy=",acc)

注：由于实验室的GPU有八块显卡，在使用的时候务必加上上面的开头的代码，以减少显卡的占用率，我用nvidia-smi命令可以查显卡的使用情况，现在没有人在使用服务器。但为了减少占用率gpu服务器，我在代码里指定了gpu_id为5。

gpu服务器

还有一个问题，最上面的代码会因为是python代码是基于TensorFlow1.0的，而系统中的TensorFlow版本为2.0而报错，于是将上面的代码改成如下所示的代码段。

错误1：

AttributeError: module 'tensorflow' has no attribute 'ConfigProto'

gpu服务器

错误2：

AttributeError: module 'tensorflow' has no attribute 'Session'

gpu服务器

错误3：

RuntimeError: set_session is not available when using TensorFlow 2.0.

gpu服务器

总结起来将上面的代码头部改成如下所示的代码就完美了。

import os
import tensorflow as tf
gpu_id = '5'
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
os.system('echo $CUDA_VISIBLE_DEVICES')
tf_config = tf.compat.v1.ConfigProto()
tf_config.gpu_options.allow_growth = True
tf.compat.v1.Session(config=tf_config)

# ===========================================================

哈哈哈哈在网上看到了一句话，感觉应该放在这篇文章的后面。

gpu服务器