In the last few weeks, I had been working on a Cloudera Manager issue that Cloudera Manager server itself was very slow in response to normal API calls, like CM API login and BDR (Back Disaster & Recovery) jobs. After weeks of troubleshooting, I finally found out the cause of the slowness in CM.
This was caused by large number of rows in AUDITS table. In my case, there were 4 million+ rows in AUDITS table and this table will be queried every time when user tries to log into CM, when CM tries to retrieve user’s last login information. Due to lack of INDEX on this table and large amount of data, query against this table was slow, up to 20 seconds, hence caused CM slow to respond to user login and hence in term cause all other requests to slow down as well.
To confirm if it is the case, use below shell script to capture the CM jstacks when CM is slow:
for i in `seq 1 10` do echo “writing to: /tmp/jstack_cm_server_$(date +%Y%m%d)_$i.out” sudo -u cloudera-scm jstack -l $(pgrep -f cloudera-scm-server) > /tmp/jstack_cm_server_$(date +%Y%m%d)_$i.out sleep 15 done
And see if you can find below thread:
"[email protected]" #32177 daemon prio=5 os_prio=0 tid=0x00007fe620012800 nid=0x6747 runnable [0x00007fe3ae832000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) .... at org.hibernate.loader.hql.QueryLoader.list(QueryLoader.java:490) at org.hibernate.hql.internal.ast.QueryTranslatorImpl.list(QueryTranslatorImpl.java:355) at org.hibernate.engine.query.spi.HQLQueryPlan.performList(HQLQueryPlan.java:195) at org.hibernate.internal.SessionImpl.list(SessionImpl.java:1269) at org.hibernate.internal.QueryImpl.list(QueryImpl.java:101) at org.hibernate.ejb.QueryImpl.getResultList(QueryImpl.java:264) at com.cloudera.cmf.persist.DbAuditDao.getAudits(DbAuditDao.java:538) at com.cloudera.server.web.cmf.CMFUserDetailsService.getLastNLogins(CMFUserDetailsService.java:371) at com.cloudera.server.web.cmf.CMFUserDetailsService.loadUserByUsername(CMFUserDetailsService.java:247) at org.springframework.security.authentication.dao.DaoAuthenticationProvider.retrieveUser(DaoAuthenticationProvider.java:81)
If you can see above thread, we can confirm that the slowness was caused by the AUDITS table, as the thread was stuck when doing getLastNLogins function call.
To fix the issue, simply:
- Clean up the AUDITS table. CM 6.2 has a feature to merge multiple entries in AUDITS table into one based on configurable time range, to reduce the number of rows in AUDITS table. However, it only works for new records, so clean up is still required for older entries.
- Add INDEX to AUDITS table based on below columns:
ACTING_USER_ID, AUDIT_TYPE, ALLOWED, CREATED_INSTANT
Please note that you will need to remove this INDEX upon next CM upgrade in case the unexpected INDEX can cause upgrade to fail.