I’m a Engineer / Architect / Builder. For more details about me, see About.
This is a reboot for writing my notes and thoughts online after several years of pause.
Latest Notes
-
The Post-Training Dilemma: Safety Alignment vs Benchmark Score
Anthropic has long positioned itself put the AI safety at the frontier. But their models had many concerning bad behaviours in Andon Labs's Vending Bench and Arena. The bad behaviours include making lies, coordinate price cartel, exploited other's desperate situation. It's challenging to balance the safety alignment, instruction-following, and model capability, how are the labs doing on this? If a bit less safety alignment can lead to higher benchmark scores, it gives the labs incentives to reduce or do less safety alignment, what will the labs choose to do?
-
Frontier Model Benchmarks Need Maintenance, Audits, and Rolling Updates
HLE shows why frontier model scores need to be read alongside audits, verified answer sets, and maintenance, especially when questions sit at the edge of expert knowledge.
-
ARC-AGI-3 shows there is still a huge gap between frontier models and humans on agentic intelligence
GPT-5.5 and Opus 4.7 scored below 1% while human baseline is 100% on ARC-AGI-3, where models need to explore and learn in the novel game-style environments.
-
Incomparable SWE-bench Pro Scores
Claude Opus 4.7 and GPT-5.5 SWE-bench Pro scores are useful directional signals, but their reported margins are not directly comparable.
Recent Posts
-
Beyond Answers, Below Autonomy: How Proactive AI Agents Offload Humans Without Overstepping
Proactive agents can offload humans from the glue work between insight and execution: they watch your systems, gather context, and turn signals into decision-ready options and actions. The key is staying below autonomy, because implicit context and accountability still sit with humans.
-
Using BERT to perform Topic Tag Prediction for Technical Articles
Updated:Experiments using BERT Mini embeddings and linear SVM for multilabel tag prediction on LinkedInfo articles.
-
A Walk Through of the IEEE-CIS Fraud Detection Challenge
Walkthrough of the IEEE-CIS fraud detection challenge with feature analysis and model experiments.
-
Skin Lesion Image Classification with Deep Convolutional Neural Networks
Updated:Deep CNN experiments (DenseNet/ResNet) on HAM10000 for skin lesion image classification.