使用Colab考量及環境設定

由於Colab最長的執行時間為12小時，但訓練YOLO通常都長達數天以上，因此，在下方的步驟中，我們建立一個專用的Colab disk空間，每次重新執行Colab不會遺失訓練結果，且很快可以設定好訓練環境並從上次中斷的地方繼續訓練。
如果使用transfer learning，例如使用已訓練好的麵包weights來訓練新的麵包種類，那麼訓練時間可大幅縮短，僅需要數個小時到半天的時間，因此很適合使用Colab。
由於需要將dataset上傳到Google Drive，且訓練過程中會持續的產生weights檔（亦可設定多少次epochs產生一個weights），因此免費的Google Drive空間很快就會耗盡，您可能需要購買額外的空間，例如，每月NT $90可擴增到200GB。

下方的步驟示範如何使用Colab免費GPU來訓練YOLO。

環境設定：1.建立Colab專用的disk空間

在您的Google Drive，建立一個folder專門for Colab使用。下方例子中，我在最上層建了一個space_Colab。

接著，把你打算要訓練的dataset（PASCAL-VOC format）上傳到此目錄下。

環境設定：2. 將Colab加入javascript whitelist

Chrome：設定🡪網站設定🡪Javascript，將下列三個網域加入white list，讓Colab頁面可長時間持續的執行不會產生javascript error。

Colab的限制

如果你在Colab輸入下方的指令，會看到目前提供的GPU型號是Tesla P100而且還是16GB的版本。

Tesla P100發表於2016年，有3584個CUDA Core，首發價格＄5,699 USD，與其它各系列的NVidia GPU的performance比較,可參考此頁面：https://www.microway.com/knowledge-center-articles/comparison-of-nvidia-geforce-gpus-and-nvidia-tesla-gpus/。

雖然在高精度FP64上的表現仍獨佔鼇頭，但是較低精度的performance已比不上較新的GPU如2080Ti，但是用在深度學習的訓練上還是相當適合，尤其是16GB這麼大的記憶體可以讓我們訓練時能設定更大的training batch，加快訓練速度。

然而想要使用這免費又超強的GPU在Colab來訓練大型的dataset，首要解決的是兩個問題：

12小時的使用時間限制
Google Drive的讀取限制

針對第一點的時間限制，我們可以先在Google Drive建立一資料夾，將所有要訓練的dataset上傳到此空間，然後將該folder mapping到Colab來使用。當然除了dataset，訓練過程中會用到以及可能產生的檔案也須放置於該folder，當訓練時間超過了12小時時間限制，我們只要重新啟動該Colab頁面，便可讀取上次的weights檔繼續訓練下去。

至於第二點的Google Drive的檔案讀取限制，是指Colab持續讀取Google Drive檔案數目 (約在7,~8000左右)若太多，則會Time out並出現Input/Output error的訊息，例如下方，我打算list一個folder的檔案（該folder有15,000個檔案），就會出現error訊息。

解決方式是將這些檔案放散放到子目錄下，讓單一目錄的檔案數目不要過大，這樣讀取時便不會產生Time out error。

Colab與Local的路徑保持一致

如果我們把local環境配置得跟雲端的Colab一樣，那麼，我們就可以在local先產生需要的dataset和設定檔後，上傳Google Drive就能直接在Colab訓練，此外也能切換在Colab或本地端訓練或執行，增加兩者併用的方便性。

所謂環境的一致指的是Colab和本地端的檔案路徑，可以用soft link方式，當Colab map到 Google Drive的folder後，我們將它指向另一個path，讓path與本地端的path格式一樣。例如，本地端的dataset path為/WORK1/dataset，Colab存取Google Drive dataset的path也是/WORK1/dataset。在Colab端執行：

from google.colab import drive

drive.mount(‘/content/gdrive’, force_remount=True)

!ln -s ‘/content/gdrive/My Drive/space_Colab’ /WORK1

預設Google Drive mount到Colab的path為/content/gdrive/My Drive，但我們使用第三行的ln -s將/WORK1指向/content/gdrive/My Drive，因此接下來便能與Local端一樣使用相同的/WORK1/dataset路徑，便能存取/content/gdrive/My Drive/dataset。

實際案例操作：

在Colab使用官方版YOLO訓練CrowdHuman Dataset

Step 1：CrowdHuman Dataset下載及轉檔

請從http://www.crowdhuman.org/下載dataset，不需申請。下載解壓後其檔案架構如下。該dataset使用的並非我們熟悉的PASCAL VOC格式，可請參考我之前的文章將其轉檔為VOC格式(您也可以不轉換另外撰寫程式直接讀取其標記檔)。

原dataset架構：

轉檔後dataset架構：

Step 3：產生訓練用YOLO dataset並減少單一資料夾檔案數目

接下來將PASCAL dataset轉為YOLO dataset格式，作法請參考我之前的文章「如何快速完成yolo-v3訓練與預測」。

產生後for YOLO訓練用的圖片及標記都會放置於同一資料夾中，總數有30,000筆(15,000張jpg圖片, 15,000個標記txt檔)，由於數目太大會讓Colab在讀取時產生Time out的error，因此必須將這些檔案分散到子資料夾中。

請執行註一的程式，即可將這30,000筆資料分散到15個資料夾中，每個folder設定為2,000筆。

Step 4：產生訓練用的YOLO dataset及設定檔

此步驟要產生YOLO訓練時需要的train.txt、test.txt、obj.data、obj.names、YOLO cfg檔案等，作法請參考我之前的文章「如何快速完成yolo-v3訓練與預測」。

Step 5：將YOLO dataset及設定檔上傳Google Drive

前一步驟會有兩個資料夾，一個是YOLO dataset，一個是設定檔，都要上傳到Google Drive。

在Google Drive建立一個folder名稱為space_Colab（可自取）。
將YOLO dataset資料夾上傳到space_Colab。
將設定檔資料夾上傳到space_Colab。

Step 6：下載官方版Darknet到Google Drive

先增一Colab頁面，指定為使用GPU（Runtime 🡪 Change runtime type）

執行下方的指令，將Darknet程式下載到Google Drive。

import os

from google.colab import drive

drive.mount(‘/content/gdrive’, force_remount=True)

%cd “/content/gdrive/My Drive/space_Colab"

git clone https://github.com/pjreddie/darknet

下載完成後，修改darknet的Makefile，將參數修改為如下：

GPU=1

CUDNN=1

OPENCV=1

OPENMP=0

DEBUG=0

存檔後，進入darknet目錄下執行make。

%cd darknet

!make

Step 7：在Colab測試YOLO

開一新的Colab頁面，執行下列的程式：

#連接Google Drive

from google.colab import drive

drive.mount(‘/content/gdrive’, force_remount=True)

#指向/WORK1

!ln -s ‘/content/gdrive/My Drive/space_Colab’ /WORK1

#將darknet加入可執行權限

!ls -la /WORK1/darknet.official/darknet

!chmod 755 /WORK1/darknet.official/darknet

!ls -la /WORK1/darknet.official/darknet

#下載YOLO COCO預訓練weights

import requests, os

weights_filename = “/WORK1/cfg_YOLO/Pretrained/yolov3.weights"

weights_url = “https://pjreddie.com/media/files/yolov3.weights"

#顯示圖片用

def imShow(path):

import cv2

import matplotlib.pyplot as plt

%matplotlib inline

image = cv2.imread(path)

height, width = image.shape[:2]

resized_image = cv2.resize(image,(3*width, 3*height), interpolation = cv2.INTER_CUBIC)

fig = plt.gcf()

fig.set_size_inches(18, 10)

plt.axis(“off")

#plt.rcParams[‘figure.figsize’] = [10, 5]

plt.imshow(cv2.cvtColor(resized_image, cv2.COLOR_BGR2RGB))

plt.show()

if( not os.path.exists(weights_filename)):

#download to google drive

r = requests.get(weights_url, stream = True)

with open(weights_filename, “wb") as file:

for block in r.iter_content(chunk_size = 1024):

if block:

file.write(block)

#執行預測

!./darknet detect cfg/yolov3.cfg /WORK1/cfg_YOLO/Pretrained/yolov3.weights data/dog.jpg

由於Darknet在作detect時會嘗試display圖片並等待使用者click，故會等待一時間才會出現訊息，您可以修改examples/detector.c，comment下方 612~615之間內容。

執行結果如下，確認Darknet可正常的執行。

Step 8：開始訓練YOLO

新增一Colab頁面，可取名為train.ipynb，執行的程式如下：

from google.colab import drive

drive.mount(‘/content/gdrive’, force_remount=True)

!ln -s ‘/content/gdrive/My Drive/space_Colab’ /WORK1

!ls -la /WORK1/darknet.official/darknet

!chmod 755 /WORK1/darknet.official/darknet

%cd /WORK1/darknet.official

!./darknet detector train /WORK1/cfg_YOLO/cfg.crowdHuman_colab/obj.data /WORK1/cfg_YOLO/cfg.crowdHuman_colab/crowd_human_yolov3_colab.cfg /WORK1/cfg_YOLO/Pretrained/darknet53.conv.74

會看到Colab載入model後開始進行訓練了。

可能過了一段時間可能訓練log畫面沒有更新，但左上角的圓形還有在轉，表示還是有在run，可不用擔心。另外，也可同步在Google Drive上，看到訓練的weights有持續在增加及更新。（如下圖紅框部份）

超過了12小時Colab就會出現Runtime disconnected不再執行（如下圖），此時可重新載入train.ipynb頁面，將最後一行訓練中所帶入的pretrained weights改為xxxx.backup，重新執行一次，便可接續最近一次訓練結果繼續訓練下去。

!./darknet detector train /WORK1/cfg_YOLO/cfg.crowdHuman_colab/obj.data /WORK1/cfg_YOLO/cfg.crowdHuman_colab/crowd_human_yolov3_colab.cfg /WORK1/cfg_YOLO/cfg.crowdHuman_colab/weights/crowd_human_yolov3_colab.backup

此時可重新載入train.ipynb頁面，將最後一行訓練中所帶入的pretrained weights改為xxxx.backup，重新執行一次，便可接續最近一次訓練結果繼續訓練下去。

訓練結果

在經過數天斷斷續續的訓練後，使用Crowd Human dataset的15,000張圖片，透過Colab所訓練結果如下，但尚未訓練到最佳的loss，原因請見本文最後，不過大部份行人都能抓出來了

結論：能用Colab取代實體GPU嗎？

Colab的優點：正如下方這些您也能想到的：

免費且又在雲端，隨時隨地時可取用。
超大的GPU memory(16GB)，遠勝2080Ti，訓練YOLOV3時，batch可以設得更大加快訓練速度。
預設已安裝好常用的AI frameworks，進入Colab便能直接使用。
連接Google Drive相當方便，只要空間夠大，便可先預存大量的dataset備用。

Colab的缺點：

使用時間限制（12hrs）。
必須一直開啟著Web browser避免Colab視窗關閉。
在執行時很容易出現Javascript的warning或錯誤訊息，造成執行中斷。
將Dataset上傳到Google Drive的時間及空間等有形和無形成本。
Colab與Google Drive之間的讀取速度緩慢，拖累了GPU執行速度。

最後，回到現實，看完上方的介紹，您是否期待著用這免費Colab GPU來取代實體GPU？很遺憾，沒有辦法，至少現階段還不行，最主要原因是Google Colab使用辦法中的這條限制：

Why are hardware resources such as T4 GPUs not available to me?

The best available hardware is prioritized for users who use Colaboratory interactively rather than for long-running computations. Users who use Colaboratory for long-running computations may be temporarily restricted in the type of hardware made available to them, and/or the duration that the hardware can be used for. We encourage users with high computational needs to use Colaboratory’s UI with a local runtime.

Please note that using Colaboratory for cryptocurrency mining is disallowed entirely, and may result in being banned from using Colab altogether.

當您突破了12小時的限制，持續的重新執行Colab跑你的YOLO training，不消幾天，您的Colab頁面就會出現下方訊息：

F:\Temp\Downloads\49332663177_f3e86fe271_k.jpg

而且，您會在短時間內被封鎖無法再使用Colab的GPU了，因為只要一執行就會出現這訊息，只能切換為CPU才能正常運作，因為您被Google逮到將Colab用在長時間運算而非測試與學習的任務，並需等待短則半天多則數天之後才能再使用GPU。

至於這封鎖時間為期多久？坦白說我也不曉得，因為我也正在封鎖期，等解封後就能告訴你答案了。

註一：

import glob, os
import os.path
import shutil

#YOLO folder must has: *.jpg and *.txt
img_count_total = 40000  #more than real number is ok
source_image_type = ".jpg"
source_yololabel_type = ".txt"
file_count_in_folder = 1000
source_dataset = "/DATA1/Datasets_mine/labeled/crowd_human_dataset/yolo2/yolo"
target_dataset = "/WORK1/dataset/CrowdHuman_YOLO_10_folders"

if not os.path.exists(target_dataset):
  os.makedirs(target_dataset)

for loop_folder in range(int(img_count_total/file_count_in_folder)+1):
  print("Loop count #{}".format(loop_folder))
  for i, file in enumerate(glob.iglob(os.path.join(source_dataset, "*"+source_image_type))):
    if(i>=file_count_in_folder):
      break

    filename = os.path.basename(file)
    file_mainname, file_extension = os.path.splitext(filename)

    source_img_file = os.path.join(source_dataset, filename )
    source_txt_file = os.path.join(source_dataset, file_mainname + source_yololabel_type )

    new_folder = os.path.join(target_dataset, str(loop_folder))
    if not os.path.exists( new_folder ):
      os.makedirs(new_folder)

    target_img_file = os.path.join(new_folder, filename )
    target_txt_file = os.path.join(new_folder, file_mainname + source_yololabel_type )

    try:
        print("#{}/{} move {},{}...".format(loop_folder, i, filename, file_mainname + source_yololabel_type))
        shutil.move(source_img_file, target_img_file)
        shutil.move(source_txt_file, target_txt_file)
    except:
        print("#{}/{} move filed".format(loop_folder, i))
        continue

使用Google Colab訓練YOLO

使用Colab考量及環境設定

環境設定：1.建立Colab專用的disk空間

環境設定：2. 將Colab加入javascript whitelist

Colab的限制

Colab與Local的路徑保持一致