PCA Clustering and Topic Modeling of New York City PPRs Demonstrate Typological Differences in Content and Success Rate of DOI Interventions in City Governance
This research investigates policy and procedural recommendations (PPRs) in New York City governance using PCA cluster analysis and natural language processing (NLP).
This research was presented at the 2025 Undergraduate Data Science Conference at Columbia University.
This project was spearheaded by Ishaan Barrett and Rohit Barrett.
Introduction
This research investigates policy and procedural recommendations (PPRs) in New York City governance using PCA cluster analysis and natural language processing (NLP). It aims to show the typological differences between accepted and rejected PPRs by identifying sentiments and key topics through topic modeling within the recommendations. By analyzing these factors, the project seeks to enhance the effectiveness of procedural recommendations, providing insights that could help policymakers and the Department of Investigation (DOI) mitigate corruption more effectively across hundreds of New York City agencies and departments. Ultimately, this work has the potential to significantly transform public policy practices and strengthen democratic safeguards in the City.
PCA Clustering and Visualization
- PPR data was loaded and separated into accepted and rejected PPRs.
- Embeddings were generated concurrently for each group and for each recommendation using the Gemini Embeddings LLM (semantic similarity).
- Dimensionality of embeddings were reduced via Principal Component Analysis (PCA).
- PPRs were clustered around other similar recommendations using K-Means clustering (with the optimal number of clusters determined by the Elbow Method; n=3 clusters for both accepted and rejected groups).
- VADER and NLTK were used to perform sentiment analysis and to extract key topics within each cluster using dependency parsing.

Per-Cluster Grouped Topic Modeling
Both groups of PPRs displayed three defined clusters; however, due to the larger sample size of the accepted PPRS (Figure 1), clustering was more defined for that group than for the rejected PPRS (Figure 2). Topic modeling was completed for each cluster in each PPR outcome group:

Accepted PPRs:
- Cluster 0: topics relate to citywide fleet management, vehicles, fuel, city-level policies, and operations concerning transportation of resources.
- Cluster 1: topics relate to specific agencies like ACS, NYCHA, DOI, and NYPD and their staff and employees, internal agency operations, personnel, and oversight.
- Cluster 2: topics relate to financial processes, directives, citywide systems like FMS, PPRs, and credit cards, focusing on financial controls, audits, and policy adherence.
Rejected PPRs
- Cluster 0: topics relate to policing, biased complaints, fleet management, city vehicles, credit cards, and departmental searches.
- Cluster 1: topics relate to city summonses, debt, contracts, vendor payments, legal cases, permanent exclusion, and staff reviews.
- Cluster 2: topics relate to fuel deliveries, court procedures, resident issues, employee relationships, and police investigations (specifically BWC and use of force).
Conclusion
Accepted PPRs relate to administrative efficiency, financial control, and citywide system management (emphasizing technical improvements and procedural compliance). Rejected PPRs, by contrast, concentrate on politically sensitive or incident-driven issues such as policing, bias complaints, legal disputes, and personnel conflicts. Even where subject matter overlaps, accepted proposals address neutral, citywide logistics, while rejected PPRs link to enforcement or accountability contexts. In effect, the clustering reveals a structural preference for process-oriented reforms over corrective or disciplinary interventions.