Part 11/13:
Analysis revealed that training data composition explains performance differences. GPT 3.5, trained on diverse sources including scientific literature, responded more reliably, while models like Falcon or smaller fine-tuned variants sometimes produced inconsistent or irrelevant outputs.
Future Directions: Toward Smarter, Safer AI in Science
The team outlined several roadmap initiatives:
Integrating user-specific context (e.g., search history) for personalized results.
Developing researcher profiles and collaboration tools.
Scaling up datasets beyond 5 years to include 45 million articles.
Improving prompt strategies and model fine-tuning tailored to biomedical applications.
Strengthening feedback mechanisms for continuous system improvement.