Original Source Here
In the first part of M1 Benchmark article I was comparing a MacBook Air M1 with an iMac 27″ core i5, a 8 cores Xeon(R) Platinum, a K80 GPU instance and a T4 GPU instance on three TensorFlow models.
While the GPU was not as efficient as expected, maybe because of the very early version of TensorFlow not yet entirely optimized for M1, it was clearly showing very impressive CPU performances beating by far the iMac and 8 cores Xeon(R) Platinum instance.
While it’s usually unfair to compare an entry price laptop with such high-end configurations, the M1 has an advantage; it has 8 physical cores.
The iMac 27″ Core i5 has 4 cores and the 8 cores Xeon(R) Platinum is like every instance counting the vCPU, so only has 4 physical cores. I’ve doubled checked
physical id to make sure by following this paper.
In this article, I compare the M1 with more powerful configurations having 8 to 20 physical cores (16 to 40 hyper-threaded cores).
The AMD EPYC configurations are instances while the Xeon(R) Silver is a BareMetal, meaning a true physical dedicated server.
The Xeon(R) server has two CPUs with 10 cores (20 threads) each, so totalize 20 physical cores (40 threads).
It uses TensorFlow 2.3.1 to benefit from some compilation options. At startup it displays the following:
This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
I also compared with the Intel(R) MKL delivered with Anaconda. Both are showing similar performances.
As a reminder (as shown in this previous article) here are the M1 specs.
- 8-core CPU (4 high performances at 3.2 GHz, 4 high efficiency at 2.06 GHz)
- 8-core GPU (128 execution units, 24 576 threads, 2.6 TFlops)
- 16-core Neural Engine dedicated to linear algebra
- Unified Memory Architecture 4 266 MT/s (34 128 MB/s data transfer)
You can find the models and the dataset used in the previous article.
The following plots shows the results for trainings on CPU.
In CPU training, the MacBook Air M1 exceeds by far the performances of the two AMD EPYC and of the 20/40 cores Intel(R) Xeon(R) Silver on MLP and LSTM. Only the convnet gives a very small advantage to the Xeon(R) for 128 and 512 samples batch size and the AMD EPYC 16/32 is only slightly better for 512 samples batch size.
The following plot shows how many times other devices are slower than M1 CPU.
For MLP and LSTM M1 is about 5 to 13 times faster than 8/16 cores AMD EPYC, 3 to 12 times faster than 16/32 cores AMD EPYC and 2 to 10 times faster than 20/40 cores Xeon(R) Silver BareMetal. For CNN, M1 is roughly 2.5 times faster than the others for 32 samples batch size, 2 times faster than AMD EPYC on the other batch size and only 10% to 20% slower than 20/40 cores Xeon(R) Silver on 128 and 512 samples batch size.
Let’s check the CPU consumption for the Xeon(R) during training.
Half of the cores are loaded at about 70% meaning the hyper-threading is useless in this case, only the physical cores are used.
Now let’s check the M1 CPU history during the whole benchmark.
It’s surprisingly showing that only 4 cores are really used but only at 50% during MLP and CNN training. The maximum load happened during the LSTM training; this is also the only case where the 4 other cores are loaded up to 50%. We can suppose the most loaded are the “high performance” cores at 3.2 GHz.
So how does the M1 can, by only partially using half of its cores, achieve such superior performances, beating by far a 20/40 cores Xeon(R) Silver using a TensorFlow version compiled with AVX-512 instructions ? The M1 “performance” cores have a frequency 50% higher than the Xeon(R) but this one has 20 physical cores, making this frequency difference not sufficient to explain such performance gap.
Why M1 is so fast ?
As Apple is never disclosing details of its processor’s designs, it’s difficult to know how the different parts of M1 are really used. Anyway, we can try to formulate some hypothesis.
- The M1 CPU includes “ML Accelerators” as mentioned here, they are used by ML Compute BNNS primitives when training Neural Networks models on CPU. If we refer to this article we can suppose “ML Accelerators” is the generic name for AMX2 (Apple Matrix Co-processor), so it’s maybe more something like a mini NPU capable to accelerate linear algebra computing and especially matrix multiplications. But unlike the Neural Engine it’s built to work with the CPU instead of being a standalone component, so reducing the latencies and improving the throughput for intensive matrix computing. This can explain why the maximum gap is observed on LSTM as this type of unit is not only vector based but require much more iterative processing. The combination of a CPU and a high throughput matrix computing unit seems very efficient for such models
- The Universal Memory design enables a much faster data access for each core and every component of the SoC coupled with a better CPU design enabling more efficient parallel processing
Again, these are hypothesis and only Apple can explain how it really works.
From these tests it appears that
- for training MLP and LSTM, M1 CPU is by far much faster than all the high-end servers tested
- for training CNN, M1 is only slightly slower than the high-end servers tested
Of course, these metrics can only be considered for similar neural network types and depths as used in this test.
For big trainings and intensive computing lasting for more than 20 minutes, I will still go for cloud-based solutions as they provide cards built for such long heavy load and enable sending several jobs simultaneously. But this scenario is only for some specific research representing only 10% of my work, mostly for professional usage in some specific business area.
As a machine learning engineer, for my day-to-day personal research, M1 Mac is clearly the best and the most cost-efficient option today.
Thank you for reading.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot