According to monitoring by Beating, AI researchers Lawrence Chan and Benno Sturgeon have published a review of the paper by Pine AI’s Chief Scientist Li Bojie titled ‘Incompressible Knowledge Probes: Estimating the Parameter Count of Black Box Large Language Models Based on Fact Capacity.’ The original paper estimated GPT-5.5 to be about 9.7T, Claude Opus 4.7 to be around 4.0T, and o1 to be approximately 3.5T using 1,400 trivia questions to ‘weigh’ the closed-source models. The reviewers believe that while the approach itself is valuable, the original figures were significantly inflated due to the scoring criteria and question quality. The main issue lies in the ‘floor score.’ The original paper divided the questions into seven difficulty levels, and when a model answered too many incorrectly at a certain level, the score could theoretically become negative; however, the code actually pulled the minimum score for each level back to 0. This inflated the performance gap of cutting-edge models on difficult questions and further increased the inferred parameter count. The paper claims this was not handled in such a manner, yet the code and published results employed this treatment. After removing the ‘floor score,’ the fitting slope decreased from 6.79 to 3.56. This slope can be understood as ‘for every point increase in the score, how much parameter growth is translated’; a smaller slope indicates that the same score difference no longer corresponds to such an exaggerated parameter difference. The R² value dropped from 0.917 to 0.815, indicating that the ‘score to parameter count’ fitting curve is not as stable as in the original paper. The 90% prediction interval expanded from 3.0 times to 5.7 times, suggesting a wider margin of error and that single-point figures should not be taken seriously. The review also pointed out that 131 out of 1,400 questions had ambiguities or incorrect answers, accounting for 9.4%. The issues were mainly concentrated in the difficult questions, which were used to differentiate cutting-edge closed-source models like GPT-5.5 and Claude Opus 4.7. According to their revised criteria, GPT-5.5 was reduced from the original paper’s 9659B to 1458B, with a 90% prediction interval of 256B to 8311B; Claude Opus 4.7 was reduced from 4042B to 1132B; and GPT-5 was reduced from 4088B to 1330B. The reviewers also emphasized that 1.5T should not be regarded as the true parameter count for GPT-5.5. A more accurate conclusion is that this ‘trivia weighing method’ is highly sensitive to scoring details and question quality, and figures like 9.7T cannot be directly used as a weight measure for closed-source models.

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
WCTCTradingKingPK
514.82K Popularity
#
USSeeksStrategicBitcoinReserve
58.74M Popularity
#
BitcoinETFOptionLimitQuadruples
1.01M Popularity
#
#FedHoldsRateButDividesDeepen
41.59K Popularity
#
DeFiLossesTop600MInApril
10.18M Popularity

Sitemap

GPT-5.5 '9.7T Parameter' Re-evaluated: Revised to Approximately 1.5T

Trending Topics

WCTCTradingKingPK

USSeeksStrategicBitcoinReserve

BitcoinETFOptionLimitQuadruples

#FedHoldsRateButDividesDeepen

DeFiLossesTop600MInApril

Pin