12月9日上午,Google母公司Alphabet董事會主席、2017年美國圖靈獎獲得者、斯坦福大學原校長約翰·軒尼詩(John Hennessy)在鈦媒體集團聯合大興產促中心、國家新媒體產業基地共同主辦的2021 T-EDGE 全球創新大會上發表了演講。
在演講中,他對AI過熱而又技術發展放緩的矛盾現狀提出了一些擔心,約翰·軒尼詩認為,人工智能(AI)技術已經存在大約60年。盡管AI深度學習在AlphaFold蛋白質結構預測、自動駕駛技術、大型數據中心和云計算等領域取得了一定突破,但他注意到,AI技術未來發展開始放緩,速度已遠低許多早期預測。
對于半導體行業,經歷了早期每年約25%增長,以及出現新的多核處理器,使得過去的兩年中,每年的性能提升不到 5%,即使多核設計也沒有顯著改善能效方面的問題。
約翰·軒尼詩表示,現在我們陷入了兩難境地,我們發明的這項新技術,深度學習,它似乎能夠高效地解決很多問題,但同時它需要大量的算力才能進步。同時,一邊我們有著登納德縮放定律,一邊有著摩爾定律,大眾再也不能期待半導體技術的更新迭代能給我們帶來飛躍的性能增長。
具體來說,實現更高的性能改進需要新的架構方法,從而更有效地使用集成電路功能,解決方案有三個可能的方向:
1、以軟件為中心的機制。即著眼于提高軟件的效率,以便更有效地利用硬件;
2、以硬件為中心的方法。也稱為特定領域架構或特定領域加速器;
3、以上兩類的部分結合。開發出與這些特定架構相匹配的語言,讓人們更有效地開發應用程序。
在這樣的變化之下,約翰·軒尼詩認為,未來通用處理器將不是驅動行業發展的主力,能夠與軟件快速聯動的特定領域處理器將會逐漸發揮重大作用。因此,接下來或許會看到一個更垂直的行業,會看到擁有深度學習和機器學習模型的開發者與操作系統和編譯器的開發者之間更垂直的整合,使他們的程序能夠有效運行、有效地訓練以及進入實際使用。
約翰·軒尼詩強調,我們來到了計算行業的一個有趣時期。正在以與過去不同的方式進行優化,這會是一個緩慢的過程,但肯定會改變整個計算機行業。這不是說通用處理器會消失,也不是說做多平臺軟件的公司將消失。是這個行業會有全新的驅動力,由我們在深度學習和機器學習中看到的巨大突破創造的驅動力。這將使未來 20 年變得非常有趣。
以下為John Hennessy演講實錄,經鈦媒體編輯整理:
Hello I"m John Hennessy, professor of computer science and electrical engineering at Stanford University, and co-winner of the Turing Award in 2017.
大家好,我是約翰·軒尼詩,斯坦福大學計算機科學與電氣工程教授,也是2017 年圖靈獎共同獲得者。
It"s my pleasure to participate in the 2021 T-EDGE conference.
很高興能參加 2021年的 T-EDGE 大會。
Today I"m going to talk to you about the trends and challenges in deep learning and semiconductor technologies, and how these two technologies want a critical building block for computing and the other incredible new breakthroughs in how we use computers are interacting, conflicting and how they might go forward.
今天我想談談深度學習和半導體技術領域的趨勢和挑戰、這兩種技術需要的關鍵突破、以及計算機領域的其他重大突破和發展方向。
AI has been around for roughly 60 years and for many years it continues to make progress but at a slow rate, much lower than many of the early prophets of AI had predicted.
人工智能技術已經存在大約 60 年,多年來持續發展。但是人工智能技術的發展開始放緩,發展速度已遠低許多早期的預測。
And then there was a dramatic breakthrough around deep learning for several small examples but certainly AlphaGo defeating the world’s go champion at least ten years before it was expected was a dramatic breakthrough. It relied on deep learning technologies, and it exhibited what even professional go players would say was creative play.
在深度學習上我們實現了重大突破。最出名的例子應該就是 AlphaGo 打敗了圍棋世界冠軍,這個成果要比預期早了至少十年。Alpha Go使用的就是深度學習技術,甚至連專業人類棋手也夸贊Alpha Go的棋藝頗具創意。
That was the beginning of a world change.
這是巨變的開端。
Today we"ve seen many other deep learning breakthroughs where deep learning is being used for complex problems, obviously crucial for image recognition which enables self-driving cars, becoming more and more useful in medical diagnosis, for example, looking at images of skin to tell whether or not a lesion is cancerous or not, and applications in natural language particularly around machine translation.
今天,深度學習也在其他領域取得重大突破,被應用于解決復雜的問題。其中最明顯的自然是圖像識別技術,它讓自動駕駛技術成為可能。圖像識別技術在醫學診斷中也變得越來越有用,可通過查看皮膚圖像判斷是否存在癌變。除此之外,還有在自然語言處理中的應用,尤其是在機器翻譯方面頗具成果。
Now for Latin-based language basically being as good as professional translators and improving constantly for Chinese to English, a much more challenging translation problem but we are seeing even a significant progress.
目前,拉丁語系的機器翻譯基本上能做到和專業翻譯人員相似的質量。在更具挑戰的漢英翻譯方面上,機器翻譯也有不斷改進,我們已經能看到顯著的進步。
Most recently we"ve seen AlphaFold 2, a deep minds approach to using deep learning for protein folding, which advanced the field by at least a decade in terms of what is doable in terms of applying this technology to biology and going to dramatically change the way we make new drug discovery in the future.
近期我們也有 AlphaFold 2,一種使用深度學習進行蛋白質結構預測的應用,它將深度學習與生物學進行結合,讓該類型的應用進步了至少十年,將極大程度地改變藥物研發的方式。
What drove this incredible breakthrough in deep learning? Clearly the technology concepts have been around for a while and in fact many cases have been discarded earlier.
是什么讓深度學習取得了以上突破?顯然,這些技術概念已經存在一段時間了,在某種程度上也曾被拋棄過。
So why was it able to make this breakthrough now?
那么為什么現在我們能夠取得突破呢?
First of all, we had massive amounts of data for training. The Internet is a treasure trove of data that can be used for training. ImageNet was a critical tool for training image recognition. Today, close to 100,000 objects are on ImageNet and more than 1000 images per object, enough to train image recognition systems really well. So that was the key.
首先是我們有了大量的數據用于訓練AI。互聯網是數據的寶庫。例如 ImageNet ,就是訓練圖像識別的重要工具。現在ImageNet 上有近 100,000 種物體的圖像,每種物體有超過 1000 張圖像,這足以讓我們很好地訓練圖像識別系統。這是重要變化之一。
Obviously we have lots of other data were using here for whether it"s protein folding or medical diagnosis or natural language we"re relying on the data that"s available on the Internet that"s been accurately labeled to be used for training.
我們當然也使用了其他大量的數據,無論是蛋白質結構、醫學診斷還是自然語言處理方面,我們都依賴互聯網上的數據。當然,這些數據需要被準確標記才能用于訓練。
Second, we were able to marshal mass of computational resources primarily through large data centers and cloud-based computing. Training takes hours and hours using thousands of specialized processors. We simply didn"t have this capability earlier. So that was crucial to solving the training problem.
第二,大型數據中心和云計算給我們帶來了大量的運算資源。使用數千個專用處理器進行人工智能訓練只需要數小時就能完成,我們之前根本沒有這種能力。因此,算力也是一個重要因素。
I want to emphasize that training is the computational intensive problem here. Inferences are much simpler by comparison and here you see the rate of growth of performance demand in petaflops days needed to train a series of models here. If you look at training AlphaZero for example requires 1000 petaflops days, roughly a week on the largest computers available in the world.
我想強調的是,人工智能訓練帶來的問題是密集的算力需求,程序推理變得簡單得多。這里展示的是訓練人工智能模型的性能需求增長率。以訓練 AlphaZero 為例,它需要 1000 pfs-day,也就是說用世界上最大規模的計算機來訓練要用上一周。
This speed has been growing actually faster than Moore"s law. So the demand is going up faster than what semiconductors ever produced even in the very best era. We"ve seen 300,000 times increase in compute from training simple models like AlexNet up to AlphaGo Zero and new models like GPT-3 had billions of parameters that need to be set. So the training in the amount of data they have to look at is truly massive. And that"s where the real challenge comes.
這個增長率實際上比摩爾定律還要快。因此,即使在半導體行業最鼎盛的時代,需求的增長速度也比半導體生產的要快。從訓練 AlexNet 這樣的簡單模型到 AlphaGo Zero,以及 GPT-3 等新模型,有數十億個參數需要進行設定,算力已經增加了 300,000 倍。這里涉及到的數據量是真的非常龐大,也是我們需要克服的挑戰。
Moore"s law, the version that Gordon Moore gave in 1975, predicted that semiconductor density would continue to grow quickly and basically double every two years but we began to diverge from that. Really quickly diverge began in around 2000 and then the spread is growing even wider. As Gordon said in the 50th anniversary of the first prediction: no exponential is forever. Moore"s law is not a theorem or something that"s definitely must hold true. It"s an ambition which the industry was able to focus on and keeping tag. If you look at this curve, you"ll notice that for roughly 50 years we drop only a factor of 15 while gaining a factor of more than almost 10,000.
摩爾定律,即戈登摩爾在 1975 年給出的版本,預測半導體密度將繼續快速增長,基本上每兩年翻一番,但我們開始偏離這一增長速度。偏離在2000 年左右出現,并逐步擴大。戈登在預測后的五十年后曾說道:沒有任何的物理事物可以持續成倍改變。當然,摩爾定律不是定理或必須成立的真理,它是半導體行業的一個目標。仔細觀察這條曲線,你會注意到在大約 50 年中,我們僅偏離了約 15 倍,但總共增長了近 10,000 倍。
So we"ve largely been able to keep on this curve but we began diverging and when you factor in increasing cost of new fab and new technologies and you see this curve when it"s converted to price per transistor not dropping nearly as fast as it once fell.
所以我們基本上能夠維持在這條曲線上,但我們確實開始跟不上了。如果你考慮到新晶圓廠和新技術的成本增加,當它轉換為每個晶體管的價格時,你會看到這條曲線的下降速度不像曾經下降的那么快。
We also have faced another problem, which is the end of so-called dennard scaling. Dennard scaling is an observation led by Robert Dennard, the inventor of DRAM that is ubiquitous in computing technology. He observes that as dimensions shrunk so would the voltage and other assonance for example. And that would result in nearly constant power per millimeter of silicon. That meant because of the amount of transistors that were in each millimeter we"re going up dramatically from one generation to the next, that power per computation was actually dropping quite quickly. That really came to a halt around 2007 and you see this red curb which was going up slowly at the beginning between 2000 and 2007 really began to take off. That meant that power was really the key issue and figuring out how to get energy efficiency would become more and more important as these technologies went forward.
我們還面臨另一個問題,即所謂的登納德縮放定律。登納德縮放定律是由羅伯特·登納德 領導的一項觀察實驗,他是DRAM的發明人。據他的觀察,隨著尺寸縮小,電壓和其他共振也會縮小,這將導致每毫米硅的功率幾乎恒定。這意味著由于每一毫米中的晶體管數量從一代到下一代急劇增加,每個計算的功率實際上下降得非常快。這在 2007 年左右最為明顯,在 2000 年到 2007 年間開始緩慢上升的功耗開始激增。這意味著功耗確實是關鍵問題,隨著這些技術的發展,弄清楚如何獲得更高的能源效率將變得越來越重要。
Combine results of this is that we"ve seen a leveling off of unit processor performance, single core performance, after going through a rapid growth in the early period of the industry of roughly 25% a year and then a remarkable period with the introduction of RISC technologies, instructional-level parallelism, of over 50% a year and then a slower period which focused very much on multicore and building on these technologies.
在經歷了行業早期每年大約 25% 的增長之后,隨著 RISC 技術的引入和指令級并行技術的出現,開始有每年超過 50% 的性能增長。之后我們就迎來了多核時代,專注于在現有技術上進行深耕。
In the last two years, only less than 5% improvement in performance per year. Even if you were to look at multicore designs with the inefficiencies that come about you see that that doesn"t significantly improve things across this.
在過去的兩年中,每年的性能提升不到 5%,即使多核設計也沒有顯著改善能效方面的問題。
And indeed we are in the we are in the era of dark silicon where multicore often slow down or shut off a core to prevent overheating and that overheating comes from power consumption.
事實上,我們正處于半導體寒冬。多核處理器還是會因為擔心過熱而限制自身的性能。而過熱的問題就來自功耗。
So what are we going to do? We"re in this dilemma here where we"ve got a new technology deep learning which seems able to do problems that we never thought we could do quite effectively. But it requires massive amounts of computing power to go forward and at the same time Moore"s law on the end of Dennard Scaling is creating a squeeze on the ability of the industry to do what it relies on for many years, namely just get the next generation of semiconductor technology everything gets faster.
那么我們能做什么呢?我們在這里陷入了兩難境地,我們擁有一項新技術,深度學習,它似乎能夠高效地解決很多問題,但同時它需要大量的算力才能進步。同時,一邊我們有著登納德縮放定律,一邊有著摩爾定律,我們再也不能期待半導體技術的更新迭代能給我們帶來飛躍的性能增長。
So we have to think about a new solution. There are three possible directions to go.
因此,我們必須考慮新的解決方案。這里有三個可能的方向。
Software centric mechanisms where we look at improving the efficiency of our software so it makes more efficient use of the hardware, in particular the move to scripting languages such as python for example better dynamically-typed. They make programming very easy but they"re not terribly efficient as you will see in just a second.
以軟件為中心的機制。我們著眼于提高軟件的效率,以便更有效地利用硬件,特別是腳本語言,例如 python。這些語言讓編程變得非常簡單,但它們的效率并不高,接下來我會詳細解釋。
Hardware centric approaches. Can we change the way we think about the architecture of these machines to make them much more efficient? This approach is called domain specific architectures or domain specific accelerator. The idea is to just do a few tasks but to tune the hardware to do those tasks extremely well. We"ve already seen examples of this in graphics for example or modem that"s inside your cell phone. Those are special purpose architectures that use intensive computational techniques but are not general purpose. They are not programmed for arbitrary things. They are not designed to do a range of graphics operations or the operation is required by modem.
以硬件為中心的方法。我們能否改變我們對硬件架構的設計,使它們更加高效?這種方法稱為特定領域架構或特定領域加速器。這里的設計思路是讓硬件做特定的任務,然后優化要非常好。我們已經在圖形處理或手機內的調制解調器中看到了這樣的例子。這些使用的是密集計算技術,不是用于通用運算的,這也意味著它們不是設計來做各種各樣的運算,它們旨在進行圖形操作的安排或調制解調器需要的運算。
And then of course some combinations of these. Can we come up with languages which match to these new domain specific architecture? Domain specific languages which improve the efficiency and let us code a range of applications very effectively.
最后是以上兩類的一些結合。我們是否能開發出與這些特定架構相匹配的語言?特定領域語言可以提高效率,讓我們非常有效地開發應用程序。
This is a fascinating slide from a paper that was done by Charles Leiserson and his colleagues at MIT and publish on Science called There"s plenty of room at the Top.
這是查理·雷瑟森和他在麻省理工學院的同事完成發表在《科學》雜志上的一篇論文內容。論文名為“頂端有足夠的空間”。
What they want to do observe is that software efficiency and the inefficiency of matching software to hardware means that we have lots of opportunity to improve performance. They took admittedly a very simple program, matrix multiply, written initially in python and ran it on an 18 core Intel processor. And simply by rewriting the code from python to C they got a factor of 47 in improvement. Then introducing parallel loops gave them another factor of approximately eight.
他們想要觀察的是軟件效率,以及軟件與硬件匹配過程中帶來的低效率,這也意味著我們有很多提高效率的地方。他們在 18 核英特爾處理器上運行了一個用 Python 編寫的簡單程序。把代碼從 Python 重寫為 C語言之后,他們就得到了 47 倍的效率改進。引入并行循環后,又有了大約 8 倍的改進。
Then introducing memory optimizations if you"re familiar with large scale metrics multiplied by doing it in blocked fashion you can dramatically improve the ability to use the cashe as effectively and thereby they got another factor a little under 20 from that about 15. And then finally using the vector instructions inside the Intel processor they were able to gain another factor of 10. Overall this final program runs more than 62,000 times faster than the initial python program.
引入內存優化后可以顯著提高緩存的使用效率,然后就又能獲得15~20倍的效率提高。然后最后使用英特爾處理器內部的向量指令,又能夠獲得10 倍的改進。總體而言,這個最終程序的運行速度比最初的 Python 程序快62,000 多倍。
Now this is not to say that you would get this for the larger scale programs or all kinds of environments but it"s an example of how much inefficiency is in at least for one simple application. Of course not many performance sensitive things are written in Python but even the improvement from C to the fully parallel version of C that uses SIMD instructions is similar to what you would get if you use the domain specific processor. It is significant just in its onw right. That"s nearly a factor of 100, more than 100, its almost 150.
當然,這并不是說在更大規模的程序或所有環境下我們都可以取得這樣的提升,但它是一個很好的例子,至少能說明一個簡單的應用程序也有效率改進空間。當然,沒有多少性能敏感的程序是用 Python 寫的。但從完全并行、使用SIMD 指令的C語言版本程序,它能獲得的效率提升類似于特定領域處理器。這已經是很大的性能提升了,這幾乎是 100 的因數,超過 100,幾乎是 150。
So there"s lots of opportunities here and that"s the key point behind us slide of an observation.
所以提升空間是很多的,這個研究的發現就是如此。
So what are these domain specific architecture? Their architecture is to achieve higher efficiency by telling the architecture the characteristics of the domain.
那么特定領域架構是什么呢?這些架構能讓架構掌握特定領域的特征來實現更高的效率。
We"re not trying to do just one application but we"re trying to do a domain of applications like deep learning for example like computer graphics like virtual reality applications. So it"s different from a strict ASIC that is designed to only one function like a modem for example.
我們在做的不只是一個應用程序,而是在嘗試做一個應用程序領域,比如深度學習,例如像虛擬現實、圖形處理。因此,它不同于ASIC,后者設計僅具有一個功能,就例如調制解調器。
It requires more domain specific knowledge. So we need to have a language which conveys important properties of the application that are hard to deduce if we start with a low level language like C. This is a product of codesign. We design the applications and the domain specific processor together and that"s critical to get these to to work together.
它需要更多特定領域的知識。所以我們需要一種語言來傳達應用程序的重要屬性,如果我們從像 C 這樣的語言開始就很難推斷出這些屬性。這是協同設計的產物。我們一起設計應用程序和特定領域的處理器,這對于讓它們協同工作至關重要。
Notice that these are not going to be things on which we run general purpose applications. It"s not the intention that we take 100 C code. It’s the intention that we take an application design to be run on that particular DSA and we use a domain specific language to convey the information to the application to the processor that it needs to get significant performance improvements.
請注意,這不是用來運行通用軟件的。我們的目的不是要能夠運行100 個 C 語言程序。我們的目的是讓應用程序設計在特定的 DSA 上運行,我們使用特定領域的語言將應用程序的信息傳達給處理器,從而獲得顯著的性能提升。
The key goal here is to achieve higher efficiency both in the use of power and transistors. Remember those are two limiters the rate at which transistor growth is going forward and the issue of power from the lack of Denard scaling. So we"re trying to really improve the efficiency of that.
這里的關鍵目標是在功率和晶體管方面實現更高的效率。請記住,晶體管增長的速度和登納德縮放定律是兩個限制因素,所以我們正在努力提高效率。
Good news? The good news here is that deep learning is a broadly applicable technology. It"s the new programming model, programming with data rather than writing massive amounts of highly specialized code. Use data to train deep learning model to detect that kind of specialized circumstance in the data.
有什么好消息嗎?好消息是深度學習是一種廣泛適用的技術。這是一種新的編程模型,使用數據進行編程,而不是編寫大量高度專業化的代碼,而是使用數據訓練深度學習模型來發現數據中的特殊情況。
And so we have a good target domain here. We have applications which are really demanding of massive amounts of performance increase through which we think there are appropriate domain specific architectures.
所以我們有一個很好的目標域,我們有一些真正需要大量性能提升的應用程序,因此我們認為是有合適的特定領域架構的。
It"s important to understand why these domain specific architectures can win in particular there"s no magic here.
我們需要弄明白這些特定領域架構的優勢。
People who are familiar with the books Dave Patterson and I co-authored together know that we believe in quantitative analysis in an engineering scientific approach to designing computers. So what makes these domain specific architectures more efficient?
熟悉大衛·帕特森和我合著的書籍的人都知道,在計算機設計上,我們信奉遵循工程學方法論的量化分析。那么是什么讓這些特定領域架構更高效呢?
First of all, they use a simple model for parallelism that works in a specific domain and that means they can have less control hardware. So for example we switch from multiple instruction multiple data models in a multicore to a single instruction data model. That means we dramatically improve the energy associated with fetching instructions because now we have to fetch one instruction rather than any instructions.
首先,他們使用一個簡單的并行模型,在特定領域工作,這意味著它們可以擁有更少的控制硬件。例如,我們從多核中的多指令多數據模型切換到單指令數據模型。這意味著我們顯著提高了與獲取指令相關的效率,因為現在我們必須獲取一條指令而不是任何指令。
We move to VLIW versus speculative out of order mechanisms, so things that rely on being able to analyze the code better know about dependences and therefore be able to create and structure parallelism at compile time rather than having to do with dynamically runtime.
我們來看看VLIW和推測性亂序機制的對比。現在需要更好處理代碼的也能夠得知其依附性,因此能夠在編譯時創建和構建并行性,而不必進行動態運行。
Second we make more effective use of memory bandwidth. We go to user controlled memory system rather than caches. Caches are great except when you have large amounts of data does streaming through them. They"re extremely inefficient that"s not what they meant to do. They are meant to work when the program does repetitive things but it is somewhat in predictable fashion. Here we have repetitive things in a very predictable fashion but very large amounts of data.
其次,我們更有效地利用內存帶寬。我們使用用戶控制的內存系統而不是緩存。緩存是好東西,但是如果要處理大量數據的話就不會那么好使了,效率極低,緩存不是用來干這事的。緩存旨在在程序執行具有重復性、可預測的操作時發揮作用。這里執行的運算雖然重復性高且可預測,但是數據量是在太大。
So we go to an alternative using prefetching and other techniques to move data into the memory once we get it into the memory within the processor within the domain specific processor. We can then make heavy use of the data before moving it back to the main memory.
那我們就用個別的方式。在我們把數據導入特定領域處理器上的內存之后,我們采用預提取和其他技術手段將數據導入內存中。接著,在我們需要把數據導去主存之前,我們就可以重度使用這些數據。
We eliminate unneeded accuracy. Turns out we need relatively much less accuracy then we do for general purpose computing here. In the case of integer, we need 8-16 bit integers. In the case of floating point, we need 16 to 32 bit not 64-bit large-scale floating point numbers. So we get efficiency thereby making data items smaller and by making the arithmetic operations more efficient.
我們消除了不需要的準確性。事實證明,我們需要的準確度比用于通用計算的準確度要低得多。我們只需要8-16位整數,要16到32位而不是64位的大規模浮點數。因此,我們通過使數據項變得更小而提高效率。
The key is that the domain specific programming model matches the application to the processor. These are not general purpose processor. You are not gonna take a piece of C code and throw it on one of these processors and be happy with the results. They"re designed to match a particular class of applications and that structure is determined by that interface in the domain specific language and the underlining architecture.
關鍵在于特定領域的編程模型將應用程序與處理器匹配。這些不是通用處理器。你不會把一段 C 代碼扔到其中一個處理器上,然后對結果感到滿意。它們旨在匹配特定類別的應用程序,并且該結構由領域特定語言中的接口和架構決定。
So this just shows you an example so you get an idea of how were using silicon rather differently in these environments then we would in a traditional processor.
這里我們來看一個例子,以便了解這些處理器與常規處理器的不同之處。
What I"ve done here is taken a first generation TPU-1 the first tensor processing unit from Google but I could take the second or third or fourth the numbers would be very similar. I show you what it looks like it"s a block diagram in terms of what the chip area devoted to. There"s a very large matrix multiply unit that can do a two 56 x 2 56 x 8 bit multiplies and the later ones actually have floating point versions of that multiplying. It has a unified buffer used for local activations of memory buffer, interfaces accumulators, a little bit of controls and interfaces to DRAM.
這里展示是谷歌的第一代 TPU-1 ,當然我也可以采用第二、第三或第四代,但是它們帶來的結果是非常相似的。這些看起來像格子一樣的圖就是芯片各區域的分工。它有一個非常大的矩陣乘法單元,可以執行兩個 56 x 2 56 x 8 位乘法,后者實具有浮點版本乘法。它有一個統一的緩沖區,用于本地內存激活。還有接口、累加器、DRAM。
Today that would be high bandwidth DRAMs early on it with DDR3. So if you look at the way in which the area is used. 44% of is used for memory to store temporary results in weights and things been computed. Almost 40% of being used for compute, 15% for the interfaces and 2% for control.
在今天我們使用的是高帶寬DRAM,以前可能用的是DDR3。那我們來具體看看這些區域的分工。 44% 用于內存以短時間內存儲運算結果。 40% 用于計算,15% 用于接口,2% 用于控件。
Compare that to a single Skylake core from an Intel processor. In that case, 33% as being used for cach. So noticed that we have more memory capacity in the TPU then we have on the Skylake core. In fact if you were to remove the caps from the cache that number because that"s overhead it"s not real data, that number would even be larger. The amount on the Skylake core will probably drop to about 30% also almost 50% more being used for active data.
將其與英特爾的 Skylake架構進行比較。在這種情況下,33% 用于緩存。請注意,我們在 TPU 中擁有比在Skylake 核心上更多的內存容量,事實上,如果移除緩存限制,這個數字甚至會更大。 Skylake 核心上的數量可能會下降到大約 30%,用于活動數據的數量也會增加近 50%。
30% of the area is used for control. That"s because the Skylake core is an out of order dynamic schedule processor like most modern general purpose processors and that requires significantly more area for the control, roughly 15 times more area for control. That control is overhead. It’s energy intensive computation unfortunately the control unit. So it"s also a big power consumer. 21% for compute.
30% 的區域用于控制。這是因為與大多數現代通用處理器一樣,Skylake 核心是一個無序的動態調度處理器,需要更多的控制區域,大約是15 倍的區域。這種控制是額外負擔。不幸的是,控制單元是能源密集型計算,所以它也是一個能量消耗大戶。 21% 用于計算。
So noticed that the big advantage that exists here is the compute areas roughly almost double what it is in a Skylake core. Memory management there"s memory management overhead and finally miscellaneous overhead. so the Skylake core is using a lot more for control a lot less for compute and somewhat less for memory.
這里存在的最大優勢是計算區域幾乎是 Skylake 核心的兩倍。內存管理有內存管理負擔,最后是雜項負擔。因此,控制占據了Skylake 核心的區域,意味著用于計算的區域更少了,內存也是同理。
So where does this bring us? We"ve come to an interesting time in the computing industry and I just want to conclude by reflecting on this and how saying something about how things are likely to go forward in the future because I think we"re at a real turning point at this point in the history of computing.
那么我們現在處于一個什么位置呢?我們來到了計算行業的一個有趣時期。我想通過分享一些我的個人思考、以及對未來的一些展望結束這場講演,因為我認為我們正處在計算領域歷史的一個轉折點。
From 1960s, the introduction of the first real commercial computers, to 1980 we had largely vertically integrated companies.
從 1960 年代第一臺真正的商用計算機的出現到 1980 年,市面上的計算機公司基本上都是垂直整合的。
IBM Burroughs Honeywell be early spin outs out of the activity at the university of Pennsylvania that built ENIAC the first electronic computer.
IBM、寶來公司、霍尼韋爾、以及其他參與了賓夕法尼亞大學制造的世界上第一臺電子計算機 ENIAC 公司都是垂直整合的公司。
IBM is the perfect example of a vertically integrated company in that period. They did everything, they built around chips they built the round disc"s in fact the West Coast operation of IBM here in California was originally open to do disc technology and the first Winchester discs were built on the West Coast.
IBM 是那個時期垂直整合公司的完美典范。IBM好像無所不能,他們圍繞著芯片制造,他們制造了光盤。事實上,IBM 在加利福尼亞的西海岸業務最初就是光盤技術,而第一個溫徹斯特光盤就是在西海岸制造出來的。
They built their own processors. The 360, 370 series, etc. After that they build their own operating system they built their own compilers. They even built their own database estate. They built their networking software. In some cases, they even built application program but certainly the core of the system from the fundamental hardware up through the databases OS compilers were all built by IBM. And the driver here was technical concentration. IBM could put together the expertise across these wide set of things, assemble a world-class team and really optimize across the stack in a way that enabled their operating system to do things such as virtual memory long before other commercial activities can do that.
他們還構建了自己的處理器,有360、370系列等等。之后他們開發了自己的操作系統、編譯器。他們甚至建立了自己的數據庫、自己的網絡軟件。他們甚至開發了應用程序。可以肯定的是,從基礎硬件到數據庫、操作系統、編譯器等系統核心都是由 IBM 自己構建的。而這里的驅動力是技術的集中。 IBM 可以將這些廣泛領域的專業知識整合在一起、組建一個世界一流的團隊、并從而優化整個堆棧,使他們的操作系統能夠做到虛擬內存這種事,這可要比在其他公司要早得多。
And then the world changed, really changed with the introduction of the personal computer. And the beginning of the micro processors takes off.
接著出現了重大變化——個人電腦的推出和微處理器的崛起。
Then we change from a vertically organized industry to a horizontally organized industry. We had silicon manufacturers. Intel for example doing processors along with AMD and initially several other companies Fairchild and Motorola. We had a company like TSMC arise through silicon foundry making silicon for others. Something that didn"t exist in earlier but really in the late 80s and 90s really began to take off and that enabled other people to build chips for graphics or other other functions outside the processor.
接著這個行業從垂直轉變為水平縱向的。我們有專精于做半導體的公司,例如英特爾和 AMD ,最初還有其他幾家公司例如仙童半導體和摩托羅拉。臺積電也通過代工崛起。這些在早期都是見不到的,但在 80 年代末和 90 年代開始逐漸起步,讓我們能夠做其它類型的處理器,例如圖形處理器等。
But Intel didn"t do everything. Intel did the processors and Microsoft then came along and did OS and compilers on top of that. And oracle companies like Oracle came along and build their applications databases and other applications on top of that. So they became very horizontally organized industry. The key drivers behind this, obviously the introduction of the personal computer.
但是英特爾并沒有一家公司包攬所有業務。英特爾專做處理器,然后微軟出現了,微軟做操作系統和編譯器。甲骨文等公司隨之出現,并在此基礎上構建他們的應用程序數據庫和其他應用程序。這個行業就變成了一個縱向發展等行業。這背后的關鍵驅動因素,顯然是個人電腦的出現。
The rise of shrinkwrap software, something a lot of us did not for see coming but really became a crucial driver, which meant that the number of architecture that you could easily support had to be kept fairly small because the software company is doing a shrink wrap software did not want to have to port and and verify that their software work done lots of different architectures.
軟件實體銷售等興起也是我們很多人沒有預料到的,但它確實成為了一個關鍵的驅動因素,這意味著必須要限制可支持的架構數量,因為軟件公司不想因為架構數量太多而需要進行大量的移植和驗證工作。
And of course the rise in the dramatic growth of the general purpose microprocessor. This is the period in which microprocessor replaced all other technologies, including the largest super computer. And I think it happened much faster than we expected by the mid 80s microprocessor put a series dent in the mini computer business and it was struggling by the by the early 90s in the main from business and by the mid 90s to 2000s really taking a bite out of the super computer industry. So even the supercomputer industry converted from customize special architectures into an array of these general purpose microprocessor. They were just far too efficient in terms of cost and performance to be to be ignored.
當然還有通用微處理器的快速增長。這是微處理器取代所有其他技術的時期,包括最大的超級計算機。我認為它發生的速度比我們預期的要快得多,因為 80 年代中期,微處理器對微型計算機業務造成了一系列影響。到 90 年代初主要業務陷入困境,而到 90 年代中期到 2000 年代,它確實奪走了超級計算機行業的一些市場份額。因此,即使是超級計算機行業,也從定制的特殊架構轉變為一系列的通用微處理器,它們在成本和性能方面的效率實在是太高了,不容忽視。
Now we"re all of a sudden in a new area where the new era not because general purpose processor is that gonna go completely go away. They going to remain to be important but they"ll be less centric to the drive to the edge to the ferry fastest most important applications with the domain specific processor will begin to play a key role. So rather than perhaps so much a horizontal we will see again a more vertical integration between the people who have the models for deep learning and machine learning systems the people who built the OS and compiler that enabled those to run efficiently train efficiently as well as be deployed in the field.
現在我們突然進入了一個新時代。這并不意味著通用處理器會完全消失,它們仍然很重要,但它們將不是驅動行業發展的主力,能夠與軟件快速聯動的特定領域處理器將會逐漸發揮重大作用。因此,我們接下來或許會看到一個更垂直的行業,會看到擁有深度學習和機器學習模型的開發者,與操作系統和編譯器的開發者之間更垂直的整合,使他們的程序能夠有效運行、有效地訓練以及進入實際使用。
Inference is a critical part is it mean when we deploy these in the field will probably have lots of very specialized processors that do one particular problem. The processor that sits in a camera for example that"s a security camera that"s going to have a very limited used. The key is going to be optimize for power and efficiency in that key use and cost of course. So we see a different kind of integration and Microsoft Google and Apple are all looking at this.
程序推理是一個關鍵部分,這意味著當我們進行部署時,可能會有很多非常專業的處理器來處理一個特定的問題。例如,位于攝像頭中的處理器用途就非常有限。當然,關鍵是優化功耗和成本。所以我們看到了一種不同的整合方案。微軟、谷歌和蘋果都在關注這個領域。
The Apple M1 is a perfect example if you look at the Apple M1, it"s a processor designed by apple with a deep understanding of the applications that are likely to run on that processor. So they have a special purpose graphics processor they have a special purpose machine learning domain accelerator on there and then they have multiple cores, but even the cores are not completely homogeneous. Some are slow low power cores, and some are high speed high-performance higher power cores. So we see a completely different design approach with lots more codesign and vertical integration.
例如Apple M1,Apple M1 就是一個完美的例子,它是由 蘋果設計的處理器,對蘋果電腦上可能運行的程序有著極好的優化。他們有一個專用的圖形處理器、專用的機器學習領域加速器、有多個核心。即使是處理器核心也不是完全同質的,有些是功耗低的、比較慢的核心,有些是高性能高功耗的核心。我們看到了一種完全不同的設計方法,有更多的協同設計和垂直整合。
We"re optimizing in a different way than we had in the past and I think this is going to slowly but surely change the entire computer industry, not the general purpose processor will go away and not the companies that make software that runs on multiple machines will completely go away but will have a whole new driver and the driver is created by the dramatic breakthroughs that we seen in deep learning and machine learning. I think this is going to make for a really interesting next 20 years.
我們正在以與過去不同的方式進行優化,這會是一個緩慢的過程,但肯定會改變整個計算機行業。我不是說通用處理器會消失,也不是說做多平臺軟件的公司將消失。我想說的是,這個行業會有全新的驅動力,由我們在深度學習和機器學習中看到的巨大突破創造的驅動力。我認為這將使未來 20 年變得非常有趣。
Thank you for your kind attention and I"d like to wish the 2021 T-EDGE conference a great success. Thank you.
最后,你耐心地聽完我這次演講。我也預祝 2021 年 T-EDGE 會議取得圓滿成功,謝謝。
網站首頁 |網站簡介 | 關于我們 | 廣告業務 | 投稿信箱
Copyright © 2000-2020 www.xnbt.net All Rights Reserved.
中國網絡消費網 版權所有 未經書面授權 不得復制或建立鏡像
聯系郵箱:920 891 263@qq.com
主站蜘蛛池模板: 国产乱视频| 精品大臿蕉视频在线观看| 国产对白在线观看| 黑白配hd视频| heyzo小向美奈子在线| 中文japanese在线播放| 侯龙涛何丽萍| 好色成人网| 日韩欧国产精品一区综合无码| 伊在人亚洲香蕉精品区| 大胸小子bd在线观看| 成年福利片120秒体验区| 男人下面进女人下面视频免费| 男人下面进女人下面视频免费| 把她抵在洗手台挺进撞击视频| 啊灬啊灬啊灬快好深用力免费| 污污视频大全| 在线观看中文字幕码2023| 在线国产欧美| 女人扒开腿让男生猛桶动漫| 日本肉文| 日本三人交xxx69视频| 国产成人三级经典中文| 无遮挡一级毛片性视频不卡| 亚洲一区二区三区免费| 加勒比色综合久久久久久久久| 日本高清xxx| 日本高清乱理论片| 午夜视频在线观看国产| 国产精品欧美一区二区三区| 久久一日本道色综合久久m| 好吊妞视频在线观看| 免费看男阳茎进女阳道动态图| 国产女人18毛片水真多18精品| 日本夫妇交换| 暖暖日本免费在线视频| swag在线观看| 女神捕电影高清在线观看| 小镇姑娘hd电影在线观看| 天天夜天干天天爽| 韩国三级电影网|