织梦CMS - 轻松建站从此开始!

欧博ABG官网-欧博官方网址-会员登入

欧博abgPerformance

时间:2025-07-19 15:03来源: 作者:admin 点击: 14 次
Given the risks associated with AI, we consider it essential to routinely evaluate and benchmark our models and share the results.

Industry-Leading Accuracy and Comprehensiveness

Scoring highest across multiple dimensions

System Pro's synthesis surpasses OpenAI's GPT-4 in terms of accuracy and relevance, while maintaining an up-to-date knowledge of scientific discoveries.

Methodology

We conducted a blind, randomized study with biomedical researchers and clinicians, recruiting participants via User Interviews between October 15 and 29, 2023. Each subject-matter expert was assigned a specific set of tasks aligned with their expertise and were asked to evaluate two randomly selected syntheses: one generated by System and the other by OpenAI's GPT-4.


For each assigned synthesis, participants rated various aspects on a scale of 1-10, with 1 indicating very poor and 10 indicating perfect. The Harmfulness rating scale was reversed.

Accuracy: Do the summaries contain factual errors, and do they provide accurate information on the topic?

Comprehensiveness: Do the summaries cover essential aspects of the topic or the question? Is there any key information missing from the summaries?

Relevance: Are the summaries relevant to what you expect to see for the topic

Clarity: Are the summaries easy to understand and do they present clear information

Harmfulness: Do you think the summaries are harmful for someone like you? Do you think trusting the information in the summary will do medical harm?


Before commencing data collection, we conducted a statistical power analysis to estimate the required amount of survey data. The reported results are based on 207 responses from 68 unique participants, achieving a statistical power of 0.86.

Researchers and clinicians prefer System Pro's synthesis

Taking accuracy, completeness, relevance, helpfulness, and clarity into account, 70% of experts prefer System Pro’s synthesis over other AI-assisted research tools.

System

68%

Commercial Product 1

32%

Methodology

We conducted a randomized single-blind study with researchers and clinicians. Users were recruited on User Interviews between October 1-15, 2023. Each subject-matter expert was assigned a set of tasks relevant to their domain of expertise. For each task, users were asked to compare two randomly assigned syntheses: one generated by System and the other by another commercial product. They were then instructed to choose the better synthesis taking into account multiple dimensions (accuracy, completeness, clarity, relevance, and helpfulness). After each selection, users were required to provide a reason for their choice. Prior to data collection, a statistical power analysis was conducted to estimate the amount of survey data needed. The presented results are based on 144 responses by 33 unique participants.

The most accurate, comprehensive, and relevant research synthesis on the market

Methodology

50 search queries done by System Pro users from June-October 2023 were used to create a dataset of syntheses from System, Commercial Product #1, and Commercial Product #2.

Syntheses (preserving citations included in the respective UIs) were randomly assigned to biomedical researchers and clinicians, blindly recruited on the Users Interviews platform, for evaluation. For each assigned synthesis, users were asked to score various aspects from 1-10, 1 being very poor and 10 being perfect.

- Accuracy: Do the summaries contain factual errors, and do they provide accurate information on the topic?

- Comprehensiveness: Do the summaries cover essential aspects of the topic or the question? Is there any key information missing from the summaries?

- Relevance: Are the summaries relevant to what you expect to see for the topic?

A Two-Sample T-Test was used to measure the difference between similar scores of different products. Once the test result became significant for all metrics, data collection was stopped. 

The presented results are based on 256 responses by 16 unique participants.

A new gold standard for explainability in AI-assisted research

The most citations

For the same query, on average, System cites 6x as many studies.

The most depth



On average, System is able to generate much longer syntheses while maintaining an unrivaled citation count.

The most breadth


System’s syntheses cover many more biomedical topics that are related to the search.

Methodology

A representative sample of 50 searches conducted by System Pro users between May and September 2023 was created. To compare System Pro with Commercial Product #1, we conducted the same search query and recorded the resulting summary and citations. Searches were done in September 2023. Commercial Product #2 does not directly synthesize search results, as it relies on a question to generate an answer. To make a direct comparison, we utilized the sections of System’s synthesis for a specific search query (for example, for user query of “SLE and b-cell depletion” System Pro generated the following sections: “Overview”, “Role of B-cells in SLE“, “B-cell depletion therapies”, “Efficacy of B-cell depletion in SLE”). We generated a question for each section using OpenAI's GPT-4 and asked Commercial Product #2 that question (in the example above, for the section called “B-cell depletion therapies” GPT-4 generated the following question: “What are the different B-cell depletion therapies used in the treatment of SLE?”). We then saved the resulting summary and articles. On average, it took 4.9 searches on Commercial Product #2 to generate a comparable summary.

(责任编辑:)
------分隔线----------------------------
发表评论
请自觉遵守互联网相关的政策法规,严禁发布色情、暴力、反动的言论。
评价:
表情:
用户名: 验证码:
发布者资料
查看详细资料 发送留言 加为好友 用户等级: 注册时间:2025-08-18 14:08 最后登录:2025-08-18 14:08
栏目列表
推荐内容