hdfsOpenFile(hdfs://nameservice1/user/hive/warehouse/default.db/table/_impala_insert_staging/f44e0332a3ec1af9_55c692eb00000000/.dh4e0332q3ac1af9-55c692wb00000003_1471427586_dir/dh4e0332q3ac1af9-55c692wb00000003_1471427586_data.0.parq): FileSystem#create((Lorg/apache/hadoop/fs/Path;ZISJ)Lorg/apache/hadoop/fs/FSDataOutputStream;) error: RemoteException: Specified block size is less than configured minimum value (dfs.namenode.fs-limits.min-block-size): -1130396776 < 1048576 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2705) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2658)The reason being that when parquet table has 10K+ columns, Impala tries to estimate memory required to process those data, and it will overflow Java’s int32 variable used in Impala code and caused negative value returned, hence caused the error we saw above. This has been reported in the upstream JIRA: IMPALA-7044. There is no workaround to fix the issue at this stage, but only to reduce the number of columns in Parquet table. Currently the maximum number of columns Impala can handle is around 8K-10K, depending on the column types, so have to re-design the table to fit with less columns. Hope above information is helpful.
If you have 10K column something is seriously wrong in that design…
It really depends. Columnar tables are designed to hold many many columns, depending on the use case, I am not surprised that some tables need to have 10K+ number of columns.