仕事No.

IRC299413

Published on 1 July 2026

AI Evaluation & Benchmarking Engineer IRC299413

職種

Software Product Engineering

経験

5-10 years

勤務地

United States - Minneapolis MN

スキル

LLM, Python

Work Model

On-site

Apply

職種概要

We are looking for an AI Evaluation & Benchmarking Engineer with experience in reinforcement learning, LLM-based agents, experiment design, benchmarking, and performance evaluation. This role will support the productionization of an AI evaluation platform used to execute and evaluate algorithms within video game environments.

The engineer will develop and integrate baseline algorithms, reinforcement learning approaches, LLM-based agents, and externally developed algorithms into the platform. This person will also design experiments, define evaluation metrics, run benchmarks, analyze performance, and serve as a primary power user of the platform to provide feedback to the engineering team.

Ideal Candidate Profile

The ideal candidate is a hands-on AI evaluation engineer who can both build and use the platform. This person should be comfortable integrating algorithms, running experiments, defining metrics, analyzing results, and giving practical feedback to engineering teams. The role requires a blend of ML experimentation, LLM agent evaluation, Python engineering, and strong platform-user instincts.

Important Note

GlobalLogic estimates the starting pay range for this role to be performed in Minneapolis, MN will be $150K to $180K and reflects base salary only and does not include additional performance-linked variable compensation, benefits etc that may be applicable for the role. This pay range is provided as a good faith estimate and the amount offered may be higher or lower. GlobalLogic takes many factors into consideration in making an offer, including candidate qualifications, work experience, operational needs, travel and onsite requirements, internal peer equity, prevailing wage, responsibilities, and other market and business considerations.

必要条件

* Hands-on reinforcement learning experience.

* Experience using LLMs for agents, evaluation, reasoning, automation, or benchmark workflows.
* Strong Python experience for ML, data workflows, experimentation, and analysis.
* Experience designing and running experiments with statistical and analytical rigor.
* Strong understanding of evaluation metrics, scoring frameworks, performance comparison, and benchmark design.
* Experience analyzing structured logs, run outputs, model/agent performance, and experiment results.
* Ability to work across APIs, logs, CLI/tools, data structures, and platform workflows.
* Strong communication skills to translate experiment findings into platform improvement requirements.
* Ability to work inside client-owned repositories, infrastructure, workflows, and security controls.

Preferred Skills

* Experience with game environments, simulation environments, Gym-like interfaces, RL environments, or agentic AI test harnesses.
* Experience benchmarking LLM agents, RL policies, autonomous agents, or hybrid AI systems.
* Experience with experiment tracking, run comparison tools, metrics dashboards, or evaluation pipelines.
* Experience with prompt engineering, agent orchestration, tool use, and LLM evaluation frameworks.
* Experience with data visualization and performance analytics.
* Experience working with externally developed algorithms, reproducible experiments, and version-controlled evaluation workflows.

職務内容

* Develop, adapt, and integrate reinforcement learning algorithms and baseline approaches into the shared evaluation platform.

* Integrate LLM-based agents and/or evaluators for solving, interacting with, and benchmarking game environments.
* Integrate external or off-the-shelf algorithms into the platform using defined execution and ingestion workflows.
* Design and run benchmark experiments across games, environments, configurations, agents, and algorithm versions.
* Define evaluation strategies for comparing RL, LLM-based, hybrid, and baseline approaches.
* Define, extract, and validate meaningful performance metrics from logs, outputs, run results, and environment interactions.
* Build comparison logic, scoring approaches, rankings, verdicts, and performance summaries.
* Develop analytics and visualizations to evaluate algorithm performance across runs and environments.
* Act as a primary power user of the platform, running experiments and identifying gaps in tooling, APIs, metrics, workflows, logs, and user experience.
* Provide structured feedback to Platform and Full Stack engineers to improve execution, logging, evaluation, and reporting capabilities.
* Validate existing game environments and support development or validation of new game environments.
* Evaluate environment operability using baseline/reference frontier LLM models, harnesses, and agents.
* Collaborate with client technical teams and engineering resources within 3M-owned repositories, workflows, infrastructure, and security processes.
* Ensure all algorithms, experiments, notebooks/scripts, configuration, documentation, and outputs comply with 3M-defined standards and policies.

私たちが提供するもの

Exciting Projects:Come take your place at the forefront of digital transformation! With clients across all industries and sectors, we offer an opportunity to work on market-defining products using the latest technologies.

Collaborative Environment: You can expand your skills by collaborating with a diverse team of highly talented people in an open, laidback environment — or even abroad in one of our global centers or client facilities!

Work-Life Balance:GlobalLogic prioritizes work-life balance, which is why we offer flexible work schedules and opportunities to work from home.

Professional Development:We provide continuing education classes, professional certification and training (technical, soft skills, language, and communication skills) to help you realize your professional goals. Being part of a global organization, there are additional learning opportunities through international knowledge exchanges.

Excellent Benefits:We provide our employees with competitive salaries, health and life insurance, short-term and long-term disability insurance, a matched contribution 401K plan, flexible spending accounts, and PTO and holidays

GlobalLogicについて

GlobalLogic, a Hitachi Group Company, is a trusted digital engineering partner to the world’s largest and most forward-thinking companies. Since 2000, we’ve been at the forefront of the digital revolution – helping create some of the most innovative and widely used digital products and experiences. Today we continue to collaborate with clients in transforming businesses and redefining industries through intelligent products, platforms, and services.

Apply Now

氏名（名） *

氏名（姓） *

電子メール *

電話番号

性別 * The gender information on this form helps us understand the makeup of our applicant pool in this key area, and to continuously improve our efforts to make our workforce more inclusive.

Select Country *

国

この仕事をどのようにお知りになりましたか？

履歴書のアップロード / LinkedInプロフィールの共有 *

ファイルをここにドラッグアンドドロップするか、ここをクリックしてアップロードします

.docx, .rtf, .pdf形式のみ、最大5MBまで。

または、Linkedin プロフィールを含めることもできます

「真のOne Hitachi」への変革をAIで駆動するデジタルエンジニアリング

GlobalLogic、Google Cloud ...