Safe and Explainable RL


  1. Returaj Burnwal
  2. Anirban Santara
  3. Nirav P. Bhatt
  4. Balaraman Ravindran
  5. Gaurav Aggarwal


Responsible AI Adherence of Language Models

Major factors affecting trustworthiness of Language Models

  • Performance
  • Explainability
  • Fairness and Bias
  • Privacy & Security

We intend to explore and utilize the Performance, Fairness & Bias aspects of Language Models to quantify adherence of Responsible AI (RAI)

Motivation: Why Safe and Explainable RL ?

  • Safety: In real-world applications, like surgical robotics or self driving cars deploying models without safety considerations can lead to undesirable consequences. Safe RL ensures that the learned policies adhere to certain constraints, preventing harmful actions.

  • Explainability: Understanding the decisions made by RL models is crucial for users, stakeholders, and regulators. Explainable RL helps build trust in the system by providing clear insights into why a particular decision or action was taken.

  • Legal and Ethical Compliance: Many industries are subject to regulations and ethical standards. Safe and explainable RL helps in complying with these requirements, avoiding legal issues and ensuring responsible AI deployment.

Notion of Safety

  • Incorporating safety constraints in the policy model.
  • Guidance from the safe expert policy.
  • Insufficient information on specific part of state-space in offline RL raises deployment risks, as the safety status of these states remains uncertain.


  1. Safe Transfer learning among different dynamical agents: Arxiv