By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
Media & appearances
No items found.

Meet Mako: unlock peak GPU performance and reduce AI inference costs automatically

Co-founder Waleed Atallah’s team is building AI-native infrastructure that automates AI kernel generation and tuning, helping developers deploy models faster, with better price-performance across NVIDIA, AMD, and custom accelerators.

TOC
...
Table of Contents
Read More

Table of contents
By
Christine Hall
Christine Hall
By M13 Team
Link copied.
August 12, 2025
|

9 min

Investment

We’re excited to welcome Mako, the AI infrastructure company helping developers unlock GPU performance, no rewrites required. Co-founded by Waleed Atallah, Mohamed Abdelfattah, and Lukasz Dudziak, Mako aims to become the hardware‑agnostic AI performance layer, the middleware that sits between any model to write, optimize, and deploy high-performance GPU code across any hardware environment with breakthrough efficiency and scale. M13 led a $8.5 million-plus seed round into Mako with participation by Neo, Flybridge and a group of angel investors, including AI pioneer and Google DeepMind chief scientist Jeff Dean.

Why we’re excited about Mako

For quite some time, Nvidia has locked up the AI compute stack, mainly through control of the default programming interface for GPU workloads, Compute Unified Device Architecture (CUDA). 

What M13 partner Karl Alomar recognized is that over time, this is not a sustainable offering. 

The developer ecosystem yearns for more flexible layers of abstraction. Just as Kubernetes abstracted away the complexity of running applications in cloud environments, Mako is doing the same for GPUs. 

Their AI‑driven engine automatically generates device‑specific kernels and tunes GPU kernel code to fit any hardware—NVIDIA, AMD, or custom accelerators—without sacrificing performance. This layer of abstraction lets developers write once, deploy anywhere. No more kernel rewrites. No more hand-tuning. Just smarter, faster infrastructure optimized by AI itself.

Now enter China-based DeepSeek. The startup made headlines earlier this year for disrupting the AI industry — and ChatGPT — with its approach to cost and performance. What made DeepSeek special was not better hardware but its handcrafted GPU code. Their close hardware-software co-design delivered GPU kernel-level optimization that unlocked extreme performance improvements (think millions of dollars saved) and showed the world that compute economics can be transformed without hardware lock-in.

“DeepSeek is demonstrating that price performance is important and not just about spending yourself to oblivion,” Alomar said. “It also strengthens the argument that there is a better way to put this hardware into the world.”

This is where Mako comes in. Mako offers scalable, hardware-agnostic continuous AI-driven optimization. The company’s AI-assisted kernel generator and auto-tuner automates the optimization of AI models, a process that is expensive and manual, turning a niche skill into accessible infrastructure and democratizing performance optimization across the AI stack.

‍In a world dominated and monopolized by one brand, it’s such a clear opportunity to give developers, and the market as a whole, the ability to abstract away from hardware and build a high-performing outcome. If you're going to back a team to build something this technically advanced, this is the team to do it with.” - M13 partner Karl Alomar

Running artificial intelligence models can be a challenge for developers, who are expected to design around hardware and software limitations. Mako wants to make this process faster and more efficient with its AI-powered graphics processing units (GPU) kernel optimization automation for all variants of GPU chips.

Co-founder and CEO Waleed Atallah explains that Mako works with any PyTorch or Hugging Face model and auto-tunes GPU kernels and inference engines to optimize performance. It also utilizes a search-based optimization engine that is continuously learning and improving to provide that speed and efficiency.

The current software standard was created in 2006 by Nvidia called Compute Unified Device Architecture (CUDA). It “gives direct access to the GPU's virtual instruction set and parallel computational elements for the execution of compute kernels.”

Companies of all sizes can easily deploy with CUDA, however, Mako’s AI-native compiler and software stack enables anyone to optimize price/performance and achieve far more out of their infrastructure at more effective costs.

“Every major company in the world that’s deploying AI at scale cares a lot about the performance that they get, and the performance is a direct result of how well the GPU kernels are selected and compiled,” Atallah said. 

As such, these kernel engineers are handsomely paid (Atallah referenced “millions” to write the kernels), “because even getting a 2% improvement in utilization can yield millions and millions of dollars in savings,” he said.

“It's one of the highest return on investment activities you can do if you're a large-scale AI company,” Atallah said. “To automate this process, which typically is done manually, is almost like being an artisan. It’s a really rare niche technology skill that is incredibly valuable and enables other people to get that high level of performance that is usually reserved for those that can afford to court the very expensive performance engineer.”

To prove that point, Mako publishes benchmarks which show how, in the short-term, the company is proving cost savings and efficiency by automating the kernel-writing process. For example, its optimized containers are able to achieve some 49% performance improvement and 70% in cost reduction when using smaller models like Mistral-7B. Or, if using Qwen2-72B, Mako is able to achieve an 85% improvement in performance and a cost reduction of 44%.

The origin story: why Mako exists

Atallah and Abdelfattah met while working at Intel about eight years ago. Atallah was on the product planning side, and Abdelfattah was on the engineering side for the same AI chip. Atallah’s background is in semiconductors, and the two were working on a type of chip called a field-programmable gate array or FPGA. They were building a deep-learning accelerator that was an AI inference accelerator.

It was doing that work that got Atallah excited about the future of AI. He left Intel in 2021 and joined a startup building a processor that is designed specifically for AI workloads. At Untether AI, he was a silicon product manager.

There, he saw the same problem that Intel was facing with regard to running AI models: someone needed to write the kernels, and it could never be done fast enough. Trends he identified included that there would be multiple hardware options, not just Nvidia, Intel and AMD. Second, there were a host of other startups and established companies creating chips —  Amazon Web Services, Azure, Google, Meta, Apple, etc.

“I figured if a company as big as Intel faces this, and as new as Untether faces this, it's probably a pretty universal thing,” Atallah said. “I saw there would be an opportunity to essentially create an automated code generator, the likes of which literally could not have existed before.”

He went on to explain that one of the major bottlenecks for AI was the graphics processing units (GPU) kernel, which is a small sub-program that runs on the chips. Atallah started digging around for ways to automatically generate GPU kernels instead of having to write everything by hand, which he described as a “painstaking and manual process.” 

This was also a problem Alomar said he saw during his earlier days as Digital Ocean’s chief operating officer. Working in cloud infrastructure, he saw the different layers of technology being abstracted away from the hardware and understood the importance of price performance was not just price or just performance, but a combination of the two.

“When I talked to Waleed, I recognized a lot of what was really obvious to me when I was running Digital Ocean around how developers want to be abstracted away from the hardware and given freedom of where they want to work,” Alomar said. 

Meanwhile, during Atallah’s research, he reached out to Abdelfattah, now a professor at Cornell focused on compiler research. They began discussing ways to retarget his research toward solving this problem of writing GPU kernels by generating them automatically. 

“We did a few experiments, and it seemed to work out pretty well,” Atallah said. “We put together a proposal and took it to Untether AI’s executive leadership. Unfortunately they turned it down.”

Atallah thought it was too big of an opportunity to pass up, so he quit his job in January 2024 to work on the project full time with Abdelfattah.

AI-native compilers: the next infrastructure unlock

Mako began as A2 Labs. To accelerate technical execution, they brought on Lukasz Dudziak, former Intel engineer and prior collaborator of Abdelfattah at Samsung, as co-founder and CTO.

Mako is building a self-optimizing AI-native compiler, something the infrastructure world has never seen at scale. Mako’s unique approach fuses traditional compilation with deep learning-based search and LLM-driven code generation.

Atallah said. “Less sexy but just as important is we use a deep learning-based search, so we have an AI that helps conduct the search, and it nudges the LLM to generate code in certain ways.” Their self-improving system replaces the manual artistry of kernel engineering with automated precision.

This approach sets Mako apart from prior efforts in the space. OctoML raised over $100 million to optimize model deployment and was later acquired by Nvidia for $250 million in 2024.

The enabling tech only now exists to support an AI-first compiler platform so starting Mako wasn’t possible two years ago. “Teams that tried to do this told me that they were just too early,” he said. “It was only with the advent of a lot of this new technology and the market trending in a certain direction in terms of available hardware options was it possible to build the kind of infrastructure layer that we're building.”

Taking GPUs into the future: software-defined performance at scale

Mako sees a future where it won’t matter what GPU you are running on or which libraries you are using. Its platform uses AI to automatically select and optimize the best combination of kernel, library, and hardware for each workload.

In a world where AI chips are dominated by essentially one brand, Alomar said there is an opportunity to give developers and the market “the ability to abstract away hardware lock-in and build a high-performing outcome output. Mako is building exactly the right way.”

With every major software paradigm shift, a new type of software infrastructure is needed. As trillions of dollars pour into GPU development, enterprises, hyperscalers, and governments are investing in differentiated chips and will need intelligent abstraction layers to deploy AI workloads seamlessly across them.

“This type of technology is going to be indispensable and become a standard,” Atallah said. “If we can enable people to build in different directions, it opens the door for the future of AI research and application.” 

Mako’s vision is clear: to become the de facto performance layer for global AI compute installed in every major data center in the world.

Read more about Waleed Atallah

These 11 startups are making AI more energy and cost-efficient, according to top VCs

Revolutionizing AI Efficiency: Mako’s Journey with Waleed Atallah

Building chips for AI with Waleed Atallah from Untether AI

https://www.linkedin.com/in/waleedatallah/

Follow Mako 

https://mako-dev.com/blog

Investment

We’re excited to welcome Mako, the AI infrastructure company helping developers unlock GPU performance, no rewrites required. Co-founded by Waleed Atallah, Mohamed Abdelfattah, and Lukasz Dudziak, Mako aims to become the hardware‑agnostic AI performance layer, the middleware that sits between any model to write, optimize, and deploy high-performance GPU code across any hardware environment with breakthrough efficiency and scale. M13 led a $8.5 million-plus seed round into Mako with participation by Neo, Flybridge and a group of angel investors, including AI pioneer and Google DeepMind chief scientist Jeff Dean.

Why we’re excited about Mako

For quite some time, Nvidia has locked up the AI compute stack, mainly through control of the default programming interface for GPU workloads, Compute Unified Device Architecture (CUDA). 

What M13 partner Karl Alomar recognized is that over time, this is not a sustainable offering. 

The developer ecosystem yearns for more flexible layers of abstraction. Just as Kubernetes abstracted away the complexity of running applications in cloud environments, Mako is doing the same for GPUs. 

Their AI‑driven engine automatically generates device‑specific kernels and tunes GPU kernel code to fit any hardware—NVIDIA, AMD, or custom accelerators—without sacrificing performance. This layer of abstraction lets developers write once, deploy anywhere. No more kernel rewrites. No more hand-tuning. Just smarter, faster infrastructure optimized by AI itself.

Now enter China-based DeepSeek. The startup made headlines earlier this year for disrupting the AI industry — and ChatGPT — with its approach to cost and performance. What made DeepSeek special was not better hardware but its handcrafted GPU code. Their close hardware-software co-design delivered GPU kernel-level optimization that unlocked extreme performance improvements (think millions of dollars saved) and showed the world that compute economics can be transformed without hardware lock-in.

“DeepSeek is demonstrating that price performance is important and not just about spending yourself to oblivion,” Alomar said. “It also strengthens the argument that there is a better way to put this hardware into the world.”

This is where Mako comes in. Mako offers scalable, hardware-agnostic continuous AI-driven optimization. The company’s AI-assisted kernel generator and auto-tuner automates the optimization of AI models, a process that is expensive and manual, turning a niche skill into accessible infrastructure and democratizing performance optimization across the AI stack.

‍In a world dominated and monopolized by one brand, it’s such a clear opportunity to give developers, and the market as a whole, the ability to abstract away from hardware and build a high-performing outcome. If you're going to back a team to build something this technically advanced, this is the team to do it with.” - M13 partner Karl Alomar

Running artificial intelligence models can be a challenge for developers, who are expected to design around hardware and software limitations. Mako wants to make this process faster and more efficient with its AI-powered graphics processing units (GPU) kernel optimization automation for all variants of GPU chips.

Co-founder and CEO Waleed Atallah explains that Mako works with any PyTorch or Hugging Face model and auto-tunes GPU kernels and inference engines to optimize performance. It also utilizes a search-based optimization engine that is continuously learning and improving to provide that speed and efficiency.

The current software standard was created in 2006 by Nvidia called Compute Unified Device Architecture (CUDA). It “gives direct access to the GPU's virtual instruction set and parallel computational elements for the execution of compute kernels.”

Companies of all sizes can easily deploy with CUDA, however, Mako’s AI-native compiler and software stack enables anyone to optimize price/performance and achieve far more out of their infrastructure at more effective costs.

“Every major company in the world that’s deploying AI at scale cares a lot about the performance that they get, and the performance is a direct result of how well the GPU kernels are selected and compiled,” Atallah said. 

As such, these kernel engineers are handsomely paid (Atallah referenced “millions” to write the kernels), “because even getting a 2% improvement in utilization can yield millions and millions of dollars in savings,” he said.

“It's one of the highest return on investment activities you can do if you're a large-scale AI company,” Atallah said. “To automate this process, which typically is done manually, is almost like being an artisan. It’s a really rare niche technology skill that is incredibly valuable and enables other people to get that high level of performance that is usually reserved for those that can afford to court the very expensive performance engineer.”

To prove that point, Mako publishes benchmarks which show how, in the short-term, the company is proving cost savings and efficiency by automating the kernel-writing process. For example, its optimized containers are able to achieve some 49% performance improvement and 70% in cost reduction when using smaller models like Mistral-7B. Or, if using Qwen2-72B, Mako is able to achieve an 85% improvement in performance and a cost reduction of 44%.

The origin story: why Mako exists

Atallah and Abdelfattah met while working at Intel about eight years ago. Atallah was on the product planning side, and Abdelfattah was on the engineering side for the same AI chip. Atallah’s background is in semiconductors, and the two were working on a type of chip called a field-programmable gate array or FPGA. They were building a deep-learning accelerator that was an AI inference accelerator.

It was doing that work that got Atallah excited about the future of AI. He left Intel in 2021 and joined a startup building a processor that is designed specifically for AI workloads. At Untether AI, he was a silicon product manager.

There, he saw the same problem that Intel was facing with regard to running AI models: someone needed to write the kernels, and it could never be done fast enough. Trends he identified included that there would be multiple hardware options, not just Nvidia, Intel and AMD. Second, there were a host of other startups and established companies creating chips —  Amazon Web Services, Azure, Google, Meta, Apple, etc.

“I figured if a company as big as Intel faces this, and as new as Untether faces this, it's probably a pretty universal thing,” Atallah said. “I saw there would be an opportunity to essentially create an automated code generator, the likes of which literally could not have existed before.”

He went on to explain that one of the major bottlenecks for AI was the graphics processing units (GPU) kernel, which is a small sub-program that runs on the chips. Atallah started digging around for ways to automatically generate GPU kernels instead of having to write everything by hand, which he described as a “painstaking and manual process.” 

This was also a problem Alomar said he saw during his earlier days as Digital Ocean’s chief operating officer. Working in cloud infrastructure, he saw the different layers of technology being abstracted away from the hardware and understood the importance of price performance was not just price or just performance, but a combination of the two.

“When I talked to Waleed, I recognized a lot of what was really obvious to me when I was running Digital Ocean around how developers want to be abstracted away from the hardware and given freedom of where they want to work,” Alomar said. 

Meanwhile, during Atallah’s research, he reached out to Abdelfattah, now a professor at Cornell focused on compiler research. They began discussing ways to retarget his research toward solving this problem of writing GPU kernels by generating them automatically. 

“We did a few experiments, and it seemed to work out pretty well,” Atallah said. “We put together a proposal and took it to Untether AI’s executive leadership. Unfortunately they turned it down.”

Atallah thought it was too big of an opportunity to pass up, so he quit his job in January 2024 to work on the project full time with Abdelfattah.

AI-native compilers: the next infrastructure unlock

Mako began as A2 Labs. To accelerate technical execution, they brought on Lukasz Dudziak, former Intel engineer and prior collaborator of Abdelfattah at Samsung, as co-founder and CTO.

Mako is building a self-optimizing AI-native compiler, something the infrastructure world has never seen at scale. Mako’s unique approach fuses traditional compilation with deep learning-based search and LLM-driven code generation.

Atallah said. “Less sexy but just as important is we use a deep learning-based search, so we have an AI that helps conduct the search, and it nudges the LLM to generate code in certain ways.” Their self-improving system replaces the manual artistry of kernel engineering with automated precision.

This approach sets Mako apart from prior efforts in the space. OctoML raised over $100 million to optimize model deployment and was later acquired by Nvidia for $250 million in 2024.

The enabling tech only now exists to support an AI-first compiler platform so starting Mako wasn’t possible two years ago. “Teams that tried to do this told me that they were just too early,” he said. “It was only with the advent of a lot of this new technology and the market trending in a certain direction in terms of available hardware options was it possible to build the kind of infrastructure layer that we're building.”

Taking GPUs into the future: software-defined performance at scale

Mako sees a future where it won’t matter what GPU you are running on or which libraries you are using. Its platform uses AI to automatically select and optimize the best combination of kernel, library, and hardware for each workload.

In a world where AI chips are dominated by essentially one brand, Alomar said there is an opportunity to give developers and the market “the ability to abstract away hardware lock-in and build a high-performing outcome output. Mako is building exactly the right way.”

With every major software paradigm shift, a new type of software infrastructure is needed. As trillions of dollars pour into GPU development, enterprises, hyperscalers, and governments are investing in differentiated chips and will need intelligent abstraction layers to deploy AI workloads seamlessly across them.

“This type of technology is going to be indispensable and become a standard,” Atallah said. “If we can enable people to build in different directions, it opens the door for the future of AI research and application.” 

Mako’s vision is clear: to become the de facto performance layer for global AI compute installed in every major data center in the world.

Read more about Waleed Atallah

These 11 startups are making AI more energy and cost-efficient, according to top VCs

Revolutionizing AI Efficiency: Mako’s Journey with Waleed Atallah

Building chips for AI with Waleed Atallah from Untether AI

https://www.linkedin.com/in/waleedatallah/

Follow Mako 

https://mako-dev.com/blog

No items found.

Read more

No items found.

The views expressed here are those of the individual M13 personnel quoted and are not the views of M13 Holdings Company, LLC (“M13”) or its affiliates. This content is for general informational purposes only and does not and is not intended to constitute legal, business, investment, tax or other advice. You should consult your own advisers as to those matters and should not act or refrain from acting on the basis of this content. This content is not directed to any investors or potential investors, is not an offer or solicitation and may not be used or relied upon in connection with any offer or solicitation with respect to any current or future M13 investment partnership. Past performance is not indicative of future results. Unless otherwise noted, this content is intended to be current only as of the date indicated. Any projections, estimates, forecasts, targets, prospects, and/or opinions expressed in these materials are subject to change without notice and may differ or be contrary to opinions expressed by others. Any investments or portfolio companies mentioned, referred to, or described are not representative of all investments in funds managed by M13, and there can be no assurance that the investments will be profitable or that other investments made in the future will have similar characteristics or results. A list of investments made by funds managed by M13 is available at m13.co/portfolio.