# # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError="/usr/lib64/cmf/service/common/killparent.sh" # Executing /bin/sh -c "/usr/lib64/cmf/service/common/killparent.sh"...One of the possible reason for the failure was due to some large job history files under /user/spark/sparkApplicationHistory (or /user/spark/spark2ApplicationHistory for Spark2). On SparkHistory server startup, it will try to load those files into memory, and if those history files are too big, it will cause history server to crash with OutOfMemory error unless heap size is increased through Cloudera Manager interface. To confirm this, just run:
hdfs dfs -ls /user/spark/sparkApplicationHistoryor
hdfs dfs -ls /user/spark/spark2ApplicationHistoryand see if there are any files that are in hundreds of MBs in size, if yes, then that will be a problem. You might also notice that some files might have “.inprogress” extension, like below:
/user/spark/spark2ApplicationHistory/application_1503451614878_0337.inprogressThose files can become stale in the HDFS directory in the case that Spark job failed prematurely, and Spark did not get a chance to clean up those files. Since SparkHistory server has no way of knowing if those files were left over from a failed Spark job, or if they were still being processed, hence it will just blindly load everything under the HDFS directory into memory, which will cause failure if those files are too big. Spark has a clean up job to remove any old files that are longer than a pre-defined time period, however, it does not remove stale .inprogress files. This issue was reported in SPARK-8617. Once SPARK-8617 is fixed, we should not see those stale .inprogress files anymore. But at the time of writing, it has not been backported into CDH yet. For now, we just need to delete all the files under /user/spark/sparkApplicationHistory or /user/spark/spark2ApplicationHistory directory in HDFS that are older than, say one week, so that those big files can be cleaned up. After that, we should be able to get SparkHistory server startup without issues.