Technology

Did xAI lie about Grok 3’s benchmarks?

Debates over AI benchmarks — and how they’re reported by AI labs — are spilling out into public view.

This week, an OpenAI employee accused Elon Musk’s AI company, xAI, of publishing misleading benchmark results for its latest AI model, Grok 3. One of the co-founders of xAI, Igor Babushkin, insisted that the company was in the right.

The truth lies somewhere in between.

In a post on xAI’s blog, the company published a graph showing Grok 3’s performance on AIME 2025, a collection of challenging math questions from a recent invitational mathematics exam. Some experts have questioned AIME’s validity as an AI benchmark. Nevertheless, AIME 2025 and older versions of the test are commonly used to probe a model’s math ability.

xAI’s graph showed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAI’s best-performing available model, o3-mini-high, on AIME 2025. But OpenAI employees on X were quick to point out that xAI’s graph didn’t include o3-mini-high’s AIME 2025 score at “cons@64.”

What is cons@64, you might ask? Well, it’s short for “consensus@64,” and it basically gives a model 64 tries to answer each problem in a benchmark and takes the answers generated most frequently as the final answers. As you can imagine, cons@64 tends to boost models’ benchmark scores quite a bit, and omitting it from a graph might make it appear as though one model surpasses another when in reality, that’s isn’t the case.

Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s scores for AIME 2025 at “@1” — meaning the first score the models got on the benchmark — fall below o3-mini-high’s score. Grok 3 Reasoning Beta also trails ever-so-slightly behind OpenAI’s o1 model set to “medium” computing. Yet xAI is advertising Grok 3 as the “world’s smartest AI.”

Babushkin argued on X that OpenAI has published similarly misleading benchmark charts in the past — albeit charts comparing the performance of its own models. A more neutral party in the debate put together a more “accurate” graph showing nearly every model’s performance at cons@64:

Hilarious how some people see my plot as attack on OpenAI and others as attack on Grok while in reality it’s DeepSeek propaganda
(I actually believe Grok looks good there, and openAI’s TTC chicanery behind o3-mini-*high*-pass@”””1″”” deserves more scrutiny.) https://t.co/dJqlJpcJh8 pic.twitter.com/3WH8FOUfic

— Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex) February 20, 2025

But as AI researcher Nathan Lambert pointed out in a post, perhaps the most important metric remains a mystery: the computational (and monetary) cost it took for each model to achieve its best score. That just goes to show how little most AI benchmarks communicate about models’ limitations — and their strengths.

Intellectual Insider

+ posts

A blog which focuses on business, Networth, Technology, Entrepreneurship, Self Improvement, Celebrities, Top Lists, Travelling, Health, and lifestyle. A source that provides you with each and every top piece of information about the world. We cover various different topics.

Technology

Meta, X approved ads containing violent anti-Muslim, antisemitic hate speech ahead of German election, study finds

Social media giants Meta and X approved ads targeting users in Germany with violent anti-Muslim and anti-Jew hate speech in the run-up to the country’s federal elections, according to new research from Eko, a corporate responsibility nonprofit campaign group.

The group’s researchers tested whether the two platforms’ ad review systems would approve or reject submissions for ads containing hateful and violent messaging targeting minorities ahead of an election where immigration has taken center stage in mainstream political discourse — including ads containing anti-Muslim slurs; calls for immigrants to be imprisoned in concentration camps or to be gassed; and AI-generated imagery of mosques and synagogues being burnt.

Most of the test ads were approved within hours of being submitted for review in mid-February. Germany’s federal elections are set to take place on Sunday, February 23.

Hate speech ads scheduled

Eko said X approved all 10 of the hate speech ads its researchers submitted just days before the federal election is due to take place, while Meta approved half (five ads) for running on Facebook (and potentially also Instagram) — though it rejected the other five.

The reason Meta provided for the five rejections indicated the platform believed there could be risks of political or social sensitivity which might influence voting.

However, the five ads that Meta approved included violent hate speech likening Muslim refugees to a “virus,” “vermin,” or “rodents,” branding Muslim immigrants as “rapists,” and calling for them to be sterilized, burnt, or gassed. Meta also approved an ad calling for synagogues to be torched to “stop the globalist Jewish rat agenda.”

As a sidenote, Eko says none of the AI-generated imagery it used to illustrate the hate speech ads was labeled as artificially generated — yet half of the 10 ads were still approved by Meta, regardless of the company having a policy that requires disclosure of the use of AI imagery for ads about social issues, elections or politics.

X, meanwhile, approved all five of these hateful ads — and a further five that contained similarly violent hate speech targeting Muslims and Jews.

These additional approved ads included messaging attacking “rodent” immigrants that the ad copy claimed are “flooding” the country “to steal our democracy,” and an antisemitic slur which suggested that Jews are lying about climate change in order to destroy European industry and accrue economic power.

The latter ad was combined with AI-generated imagery depicting a group of shadowy men sitting around a table surrounded by stacks of gold bars, with a Star of David on the wall above them — with the visuals also leaning heavily into antisemitic tropes.

Another ad X approved contained a direct attack on the SPD, the center-left party that currently leads Germany’s coalition government, with a bogus claim that the party wants to take in 60 million Muslim refugees from the Middle East, before going on to try to whip up a violent response. X also duly scheduled an ad suggesting “leftists” want “open borders”, and calling for the extermination of Muslims “rapists.”

Elon Musk, the owner of X, has used the social media platform where he has close to 220 million followers to personally intervene in the German election. In a tweet in December, he called for German voters to back the Far Right AfD party to “save Germany.” He has also hosted a livestream with the AfD’s leader, Alice Weidel, on X.

Eko’s researchers disabled all test ads before any that had been approved were scheduled to run to ensure no users of the platform were exposed to the violent hate speech.

It says the tests highlight glaring flaws with the ad platforms’ approach to content moderation. Indeed, in the case of X, it’s not clear whether the platform is doing any moderation of ads, given all 10 violent hate speech ads were quickly approved for display.

The findings also suggest that the ad platforms could be earning revenue as a result of distributing violent hate speech.

EU’s Digital Services Act in the frame

Eko’s tests suggests that neither platform is properly enforcing bans on hate speech they both claim to apply to ad content in their own policies. Furthermore, in the case of Meta, Eko reached the same conclusion after conducting a similar test in 2023 ahead of new EU online governance rules coming in — suggesting the regime has no effect on how it operates.

“Our findings suggest that Meta’s AI-driven ad moderation systems remain fundamentally broken, despite the Digital Services Act (DSA) now being in full effect,” an Eko spokesperson told TechCrunch.

“Rather than strengthening its ad review process or hate speech policies, Meta appears to be backtracking across the board,” they added, pointing to the company’s recent announcement about rolling back moderation and fact-checking policies as a sign of “active regression” that they suggested puts it on a direct collision course with DSA rules on systemic risks.

Eko has submitted its latest findings to the European Commission, which oversees enforcement of key aspects of the DSA on the pair of social media giants. It also said it shared the results with both companies, but neither responded.

The EU has open DSA investigations into Meta and X, which include concerns about election security and illegal content, but the Commission has yet to conclude these proceedings. Though, back in April it said it suspects Meta of inadequate moderation of political ads.

A preliminary decision on a portion of its DSA investigation on X, which was announced in July, included suspicions that the platform is failing to live up to the regulation’s ad transparency rules. However, the full investigation, which kicked off in December 2023, also concerns illegal content risks, and the EU has yet to arrive at any findings on the bulk of the probe well over a year later.

Confirmed breaches of the DSA can attract penalties of up to 6% of global annual turnover, while systemic non-compliance could even lead to regional access to violating platforms being blocked temporarily.

But, for now, the EU is still taking its time to make up its mind on the Meta and X probes so — pending final decisions — any DSA sanctions remain up in the air.

Meanwhile, it’s now just a matter of hours before German voters go to the polls — and a growing body of civil society research suggests that the EU’s flagship online governance regulation has failed to shield the major EU economy’s democratic process from a range of tech-fueled threats.

Earlier this week, Global Witness released the results of tests of X and TikTok’s algorithmic “For You” feeds in Germany, which suggest the platforms are biased in favor of promoting AfD content versus content from other political parties. Civil society researchers have also accused X of blocking data access to prevent them from studying election security risks in the run-up to the German poll — access the DSA is supposed to enable.

“The European Commission has taken important steps by opening DSA investigations into both Meta and X, now we need to see the Commission take strong action to address the concerns raised as part of these investigations,” Eko’s spokesperson also told us.

“Our findings, alongside mounting evidence from other civil society groups, show that Big Tech will not clean up its platforms voluntarily. Meta and X continue to allow illegal hate speech, incitement to violence, and election disinformation to spread at scale, despite their legal obligations under the DSA,” the spokesperson added. (We have withheld the spokesperson’s name to prevent harassment.)

“Regulators must take strong action — both in enforcing the DSA but also for example implementing pre-election mitigation measures. This could include turning off profiling-based recommender systems immediately before elections, and implementing other appropriate ‘break-glass’ measures to prevent algorithmic amplification of borderline content, such as hateful content in the run-up elections.”

The campaign group also warns that the EU is now facing pressure from the Trump administration to soften its approach to regulating Big Tech. “In the current political climate, there’s a real danger that the Commission doesn’t fully enforce these new laws as a concession to the U.S.,” they suggest.

Intellectual Insider

+ posts

Technology

Nvidia CEO Jensen Huang says market got it wrong about DeepSeek’s impact

Nvidia founder and CEO Jensen Huang said the market got it wrong when it comes to DeepSeek’s technological advancements and its potential to negatively impact the chipmaker’s business.

Instead, Huang called DeepSeek’s R1 open source reasoning model “incredibly exciting” while speaking with Alex Bouzari, CEO of DataDirect Networks, in a pre-recorded interview that was released on Thursday.

“I think the market responded to R1, as in, ‘Oh my gosh. AI is finished,’” Huang told Bouzari. “You know, it dropped out of the sky. We don’t need to do any computing anymore. It’s exactly the opposite. It’s [the] complete opposite.”

Huang said that the release of R1 is inherently good for the AI market and will accelerate the adoption of AI as opposed to this release meaning that the market no longer had a use for compute resources — like the ones Nvidia produces.

“It’s making everybody take notice that, okay, there are opportunities to have the models be far more efficient than what we thought was possible,” Huang said. “And so it’s expanding, and it’s accelerating the adoption of AI.”

He also pointed out that, despite the advancements DeepSeek made in pre-training AI models, post-training will remain important and resource-intensive.

“Reasoning is a fairly compute-intensive part of it,” Huang added.

Nvidia declined to provide further commentary.

Huang’s comments come almost a month after DeepSeek released the open source version of its R1 model which rocked the AI market in general and seemed to disproportionately affect Nvidia. The company’s stock price plummeted 16.9% in one market day upon the release of DeepSeek’s news.

Nvidia’s stock closed at $142.62 a share on January 24, according to data from Yahoo Finance. The following Monday, January 27, the stock dropped rapidly and closed at $118.52 a share. This event wiped $600 billion off of Nvidia’s market cap in just three days.

The chip company’s stock has almost fully recovered since then. On Friday the stock opened at $140 a share, which means the company has been able to almost fully regain that lost value in about a month. Nvidia reports its Q4 earnings on February 26 which will likely address the market reaction more.

Meanwhile, DeepSeek announced on Thursday that it plans to open source five code repositories as part of an “open source week” event next week.

Intellectual Insider

+ posts

Technology

Rivian inches closer to profitability but warns ‘changes to government policies’ could hurt

Rivian’s cost-cutting measures have gotten it a lot closer to profitability, but the company is warning that 2025 could still be a challenging year — especially because of the whorl of uncertainty caused by the new Trump administration.

The company announced Thursday its fourth-quarter and full-year 2024 financial results, and along with it, shared plans to deliver between 46,000 and 51,000 EVs across 2025. Rivian cautioned that “changes to government policies and regulations, and a challenging demand environment” could affect those results, according to the shareholder letter the EV maker released alongside its results.

Rivian didn’t specify what those changes might be, but Trump said on the campaign trail that he was inclined to find a way to kill the $7,500 federal EV tax credit. Friend of the Trump administration Vivek Ramaswamy has also called for the clawback of a $6.6 billion loan from the Department of Energy to build a plant in Georgia. That loan was finalized three days before Trump took office.

“We’re really looking forward to working with the new administration and Department of Energy on our loan, and we share in the President’s desire to bring jobs back to the US,” Rivian’s chief financial officer Claire McDonough said on a conference call Thursday, noting that the company plans to create 7,500 manufacturing jobs at the planned Georgia plant. She said later in the call that Rivian is planning to take a hit as big as “hundreds of millions” of dollars related to tariffs, any loss of EV credits, and other policy changes.

“We really believe, and we’re very aligned with the administration on this, that the U.S. needs to continue to be a world leader in this regard, and our investment into electronics, into software, into autonomy and AI — these are really key areas for us as a country to continue to exercise a leadership position in,” CEO RJ Scaringe said on the call.

Rivian’s cost-cutting tear

Rivian spent much of 2024 on a cost-cutting tear. It laid off 10% of its workforce in February, and rolled out simplified, cheaper-to-make versions of its flagship EVs — the R1T pickup and the R1S SUV — in June. The company ended up changing 600 parts on those vehicles to drive down manufacturing costs, while also revamping its electric architecture and software user interface.

Changes like those helped Rivian notch $170 million of positive gross profit in the final quarter of 2024 – though $60 million of that came from software and services.

Rivian reported $1.7 billion in revenue for the fourth quarter, a 32% increase from the same period in 2023. The bulk of its Q4 revenue — about $1.5 billion — came from the sale of 14,183 vehicles as well as $299 million from the sale of zero-emissions regulatory credits to automakers. For the year, Rivian reported $325 million in revenues from the sale of regulatory credits.

Revenue from software is increasingly playing an important role. Rivian generated $214 million from software and services in the fourth quarter, double the amount from the same-year ago period. Rivian reported $484 million in revenue for 2024 from software and services.

Rivian may be in the business of building and selling EVs, but its future is also largely pinned to software, namely through a lucrative joint venture with Volkswagen Group.

Revenue from software was primarily driven by charging and subscriptions fees, repair and maintenance services, and new vehicle electrical architecture and software development services provided by the joint venture, according to Rivian.

Gen AI comes to Rivian

The company has turned to generative AI as one tool to streamline customer service and reduce costs. The idea is to use AI to automate processes and “greatly reduce administrative overhead on all non-repair tasks,” the company said in its shareholder letter.

What that looks like in practice is an AI assistant, or chatbot, integrated into the Rivian app. The company rolled out a beta version in the Rivian mobile app for R1 customers this past December.

The AI assistant was built using a combination of in-house AI agent infrastructure and a third-party large language models, according to a Rivian spokesperson, who added the company has guardrails in place to limit the conversation to Rivian service and guide-related questions.

The AI assistant was designed to answer questions about service needs and general questions about the vehicle. A company spokesperson said it can also do basic troubleshooting, collect necessary information for service, and answer general questions about the vehicle.

This story has been updated with information from Rivian’s quarterly earnings call.