AI + Tech Dashboard

Research first. Strong opinions second. Live data throughout.

Latest arXiv papers, your opinionated takes, and benchmark context in one page built for trust and repeat visits.

arXiv · cs.CV · June 18, 2026

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

Creating 3D visual illusions, a single 3D mesh that reveals entirely different semantics from various viewing angles, is a fascinating but tough challenge. Existing optimization-based methods are slow and can produce oversaturated colors. In contrast, naive stitching approaches fail to produce geometrically coherent objects. This results in visible unnatural seams and semantic leaks. In this paper, we present a fast and training-free framework for generating text-driven 3D visual illusions. Our approach decouples the generation into two stages. First, we propose a cross-space dual-branch denoising process. This process dynamically decodes 3D latents into voxel space for CLIP-guided orientation alignment and Signed Distance Field (SDF) blending, which ensures seamless geometric fusion. Second, we introduce a view-conditioned texture synthesis module that projects and aggregates view-specific 2D diffusion priors onto the fused geometry. Extensive experiments demonstrate that our method generates highly realistic, dual-semantic 3D illusions in just 3-5 minutes. It significantly outperforms existing methods in geometric integrity, semantic recognizability, and efficiency. Project page: https://siang1105.github.io/JanusMesh.github.io/

Siang-Ling Zhang, Huai-Hsun Cheng, Tsung-Ju Yang +1 more

arXiv · cs.RO · June 18, 2026

MemoryWAM: Efficient World Action Modeling with Persistent Memory

Robust robotic manipulation in the real world requires not only an understanding of the current observation, but also memory and dynamics modeling. World action models (WAMs) possess these capabilities by jointly modeling visual foresight and actions conditioned on both current and historical observations, making them a promising paradigm for robotic manipulation. However, existing WAMs face a fundamental trade-off: methods with efficient inference typically condition only on a bounded window of recent observations and therefore struggle in non-Markovian environments, whereas methods that preserve long histories incur time and space costs that grow substantially with sequence length. To address this challenge, we introduce MemoryWAM, a world action model with efficient persistent memory. MemoryWAM uses a hybrid memory design that combines recent frames, event-boundary anchor frames, and compact gist tokens that summarize long-range history. A tailored attention mechanism enables retrieval of both detailed short-term context and compressed long-term context, supporting memory-dependent decision-making with reduced inference latency and GPU memory usage. Across long-horizon, memory-dependent manipulation tasks in both simulation and the real world, MemoryWAM outperforms strong vision-language-action (VLA) and WAM baselines while maintaining favorable computational efficiency.

arXiv · cs.CV · June 18, 2026

TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living

Long Video Question Answering (LVQA) requires identifying sparse, query-relevant evidence within hours-long untrimmed videos. Existing approaches either process videos densely with large vision-language models (VLMs), incurring prohibitive computational cost, or rely on sparse caption-based reasoning, which often misses temporally localized and motion-centric evidence. We introduce TimeProVe, a cost-efficient hybrid framework for temporally grounded reasoning in long videos. TimeProVe first employs lightweight modules to generate action-grounded answer--evidence hypotheses and subsequently invokes an expensive VLM only for targeted verification. The core of our framework lies in the Action-based Candidate Evidence (ACE) module, which converts temporally localized actions into query-conditioned candidate answers and supporting evidence windows through lightweight LLM reasoning. We further introduce OpenTSUBench (OTB), an open-ended benchmark designed to evaluate temporally grounded reasoning in real-world Activities of Daily Living (ADL) scenarios. Experiments show that TimeProVe outperforms the strongest baseline on OTB by 7.3%, while reducing VLM calls by 75% and inference cost by 93%. Furthermore, without explicit temporal grounding training, TimeProVe achieves competitive performance on Charades-STA, and reaches state-of-the-art results when enhanced with grounding VLMs.

arXiv · cs.LG · June 18, 2026

How Transparent is DiffusionGemma?

LLM reasoning transparency is a critical affordance for understanding model decisions, mitigating misuse and misalignment, and debugging surprising model behaviors. However, DiffusionGemma performs a larger fraction of its computation in a continuous latent space; does this make its reasoning less transparent? We study this question by decomposing transparency into two components: variable transparency, whether we understand intermediate snapshots of a model's computational state; and algorithmic transparency, whether we can use these snapshots to reconstruct the process by which the model arrived at its outputs. Naively, DiffusionGemma has poor variable transparency: its opaque serial depth, the amount of serial computation that occurs in between interpretable model states, seems at first 28.6X higher than the corresponding autoregressive Gemma 4 model. However, we show that we can map the information flowing between denoising steps through an interpretable token bottleneck with no decrease in downstream performance. Treating these intermediate states as interpretable reduces the opaque serial depth to just 1.1X that of Gemma 4. Algorithmic transparency is harder for diffusion models than for autoregressive models because all token predictions in the canvas can change at every denoising step, giving the model the power to implement complicated distributed algorithms during the denoising process. To begin bridging this gap, we conduct a suite of interpretability case studies, uncovering initial evidence of novel diffusion-specific phenomena such as non-chronological reasoning, token and sequence smearing, and intermediate-context reasoning. Finally, we test monitorability, a key application of transparency that measures whether model outputs are useful for downstream tasks. We find that DiffusionGemma is similarly monitorable to Gemma 4.

arXiv · cs.CV · June 18, 2026

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

Egocentric video understanding is inherently limited by the narrow perspective of wearable cameras: a single viewpoint, a single modality, a single model cannot capture the full richness of human action. We argue that a truly expressive egocentric representation must subsume complementary knowledge across viewpoints, modalities, and foundation model representations, yet remain deployable from egocentric video alone. To this end, we introduce a hierarchical multi-teacher distillation framework that produces UNIEGO, a unified egocentric encoder trained with nine teachers spanning ego-exo viewpoints, RGB, depth, and skeleton modalities, and four foundation models. Rather than distilling directly from heterogeneous teachers whose incompatible architectures and feature geometries induce conflicting gradients, our framework interposes a layer of representation-specific Proxy models that translate diverse teacher knowledge into a homogeneous egocentric space. A second distillation stage, Selective Proxy Distillation (SPD), then adaptively selects, for each training sample, the subset of proxies that are both correct and confident, distilling exclusively from reliable supervision and suppressing erroneous signals. SPD is further stabilized by initializing UNIEGO as a learned convex combination of proxy parameters, placing the unified model in a well-conditioned region of the loss landscape before distillation begins. UNIEGO achieves state-of-the-art performance across three egocentric video understanding tasks - action recognition, video retrieval, and action segmentation on three challenging ego-exo benchmarks, outperforming naive multi-teacher distillation baselines and demonstrating that structured, proxy-mediated knowledge transfer yields richer and more discriminative egocentric representations.

arXiv · cs.LG · June 18, 2026

Optimal Deterministic Multicalibration and Omniprediction

A model is multicalibrated on a collection of group weights $G$ if it is calibrated -- i.e. unbiased even conditional on its prediction -- not just overall, but also after reweighting contexts by each $g \in G$. It is a useful property for many downstream applications and is a basic desideratum of trustworthy machine learning. Before this work, all predictors known to attain the minimax-optimal $\widetilde O(\varepsilon^{-3})$ sample complexity rate for $\varepsilon$-multicalibration were randomized, while deterministic predictors were known only with substantially worse sample complexity. Whether randomization is necessary for optimal sample complexity in multicalibration was explicitly asked by [CLNR26] and implicitly in several prior works. We resolve this open problem by giving a minimax-optimal multicalibration algorithm that outputs a deterministic predictor. We then generalize the algorithm to produce optimal deterministic predictors that satisfy outcome indistinguishability (OI) with respect to finite or finitely covered collections of tests. As an application, this also gives deterministic omnipredictors and panpredictors with optimal sample complexity, resolving open problems posed by [OKK25] and [BHHLZ25].

arXiv · cs.CV · June 18, 2026

Thinking in Boxes: 3D Editing in Real Images Made Easy

Text and 2D-conditioning interfaces provide weak, ambiguous control over spatial transformations in image editing -- particularly under large object motions and camera changes. Prior work has used 3D primitives such as boxes, but only as loose conditioning signals indicating approximate object location rather than specifying the transformation. We instead use 3D boxes as structured specifications: the user provides the input and output boxes of the edit, casting editing as a well-posed geometry problem. This ``thinking in boxes'' interface, where each box face is color-coded to convey 3D orientation, gives precise control over translation, rotation, scaling, and viewpoint changes in real images while preserving scene and object identity, and recovering previously unseen object regions. To ground transformations in scene appearance, we introduce a depth-aligned planar floor as a global reference frame, shaded with depth-aware cues. Conditioned on this structure, an image generator produces consistent results under large transformations. Trained in two stages -- on synthetic multi-object scenes and a small set of real-world videos from Objectron -- the system generalizes to complex, in-the-wild real images. Our method operates directly on real photographs and substantially outperforms recent state-of-the-art methods on large 3D edits.

Research source: arXiv API.

Opinion Desk

Your latest takes, front and center.

High-space editorial layout so your voice is the main event, with direct links to full blog pages.

Browse all posts Follow via RSS

February 18, 2026 · 8 min read

What's the best AI model? It depends.

A practical framework for choosing AI models by workload, with live benchmark context for writing, coding, and agentic execution.

AI ModelsBenchmarksLLMsTech Strategy

February 17, 2026 · 7 min read

OpenClaw + OpenAI: massive strategic win or expensive integration failure?

If OpenAI acquired OpenClaw, the upside could be distribution and product speed. The downside could be product overlap, antitrust pressure, and execution drag.

M&AAI PlatformsOpenAI

February 16, 2026 · 7 min read

The AI bubble might cool hard before the real winners emerge

The hype cycle is peaking in some segments, but infrastructure and enterprise adoption suggest a rotation, not total collapse.

AI MarketVentureHype Cycle

February 15, 2026 · 8 min read

Gold, China, and currency influence: what matters and what is overstated

China's gold strategy matters for reserves, signaling, and pricing influence, but a full gold-standard return is still unlikely in the near term.

MacroGoldChina

February 14, 2026 · 7 min read

Why RAM prices are still high: AI data centers, supply discipline, and pushback

Memory demand from AI infrastructure is colliding with concentrated supply and disciplined production, keeping prices elevated.

HardwareMemoryData Centers

February 13, 2026 · 9 min read

Why tech costs more now, even when manufacturing keeps improving

Better production techniques reduce unit costs, but premium positioning, bundled software value, and market anchoring keep end-user prices high.

PricingConsumer TechApple

February 12, 2026 · 7 min read

GTA 6 delay risk: when massive hype can turn into a launch liability

Long sequel gaps can increase expectations faster than any studio can satisfy. GTA 6 could still dominate, but over-hype raises failure risk.

GamingGTA 6Hype Cycles

Live LLM Leaderboard Pulse

Snapshot of current top-performing models by intelligence and coding benchmarks.

Rank	Model	Creator	Intelligence Index	Coding Index
1	Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)	Anthropic	59.9	76.5
2	Claude Opus 4.8 (Adaptive Reasoning, Max Effort)	Anthropic	55.7	74.3
3	GPT-5.5 (xhigh)	OpenAI	54.8	74.9
4	Claude Opus 4.7 (Adaptive Reasoning, Max Effort)	Anthropic	53.5	73.6
5	GPT-5.5 (high)	OpenAI	53.1	71.6

Rank	Model	Creator	Coding Index	Intelligence Index
1	Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)	Anthropic	76.5	59.9
2	GPT-5.5 (xhigh)	OpenAI	74.9	54.8
3	Claude Opus 4.8 (Adaptive Reasoning, Max Effort)	Anthropic	74.3	55.7
4	Claude Opus 4.7 (Adaptive Reasoning, Max Effort)	Anthropic	73.6	53.5
5	GPT-5.5 (high)	OpenAI	71.6	53.1

Latest From The Web

Live multi-source stream from Hacker News, Reddit, and DEV Community.

Auto-refresh cadence: every 15-20 minutes via server-side fetch.

DEV Community · June 18, 2026 · by Aria Heller

I'm not a developer, but I built a calendar app to fix my most annoying work task

DEV Community · June 18, 2026 · by Daniel Balcarek

Tower Before Dusk: I Built a Puzzle Game for Humans and AI

DEV Community · June 16, 2026 · by Sergei Frangulov

AI took the friction out of my work. Then I found out the friction was holding up two things: my ideas and my brakes. Twenty-five years in a confession.

DEV Community · June 15, 2026 · by Julien Avezou

Research first. Strong opinions second. Live data throughout.

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

MemoryWAM: Efficient World Action Modeling with Persistent Memory

TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living

How Transparent is DiffusionGemma?

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

Optimal Deterministic Multicalibration and Omniprediction

Thinking in Boxes: 3D Editing in Real Images Made Easy

Your latest takes, front and center.

What's the best AI model? It depends.

OpenClaw + OpenAI: massive strategic win or expensive integration failure?

The AI bubble might cool hard before the real winners emerge

Gold, China, and currency influence: what matters and what is overstated

Why RAM prices are still high: AI data centers, supply discipline, and pushback

Why tech costs more now, even when manufacturing keeps improving

GTA 6 delay risk: when massive hype can turn into a launch liability

Live LLM Leaderboard Pulse

Latest From The Web

I'm not a developer, but I built a calendar app to fix my most annoying work task

Tower Before Dusk: I Built a Puzzle Game for Humans and AI

AI took the friction out of my work. Then I found out the friction was holding up two things: my ideas and my brakes. Twenty-five years in a confession.

Building a Chrome Extension to Make AI Use More Intentional

Turning Gemma 4 into an Old Korean Translator

Launch HN: Tensil (YC S19) – Open-Source ML Accelerators