Carbonate issues, GPU availability, Tensorflow errors: Week 10 & Week 11#

What I did this week#

Recently, I’ve been an assigned RP(Research Project) account on University of Bloomington’s HPC cluster - Carbonate. This account lets me access multiple GPUs for my experiments in a dedicated account.

Once I started configuring my sbatch file accordingly, I started running into issues like GPU access. My debug print statements revealed that I’m accessing 1 CPU despite configuring the sbatch job for more than 1 GPUs. I double checked my dataloader definition, DistributionStrategy, train function. I read through IU’s blogs as well as other resources online to see if I’m missing something.

Nothing worked, my mentor eventually asked me to raise a IT request on Carbonate, the IT personnel couldn’t help either. This could only mean that Tensorflow is picking upon assigned GPUs. So, on my mentor’s suggestion, I loaded an older version of the deeplearning module 2.9.1(used 2.11.1 earlier). This worked!

This also meant using a downgraded version of tensorflow(2.9). This meant I ran into errors again, time taking yet resolvable. I made some architectural changes - replaced GroupNorm with BatchNorm layers, tensor_slices based DataLoader to DataGenerator - to accommodate for the older tensorflow version. Additionally, I also had to change the model structure from a list of layers to tensorflow.keras.Sequential set of layers with input_shape information defined in the first layer. Without this last change, I ran into None object errors.

Once all my new code was in place, the week ended, hahahah. And also GPU’s were in scarcity in the same week. I’m glad I got some work done though.

What Is coming up next week#

Run more experiments!

Did I get stuck anywhere#

All I did was get stuck again & again :P But all is well now.