spark错误 * Null value appeared in non-nullable fieldjava.lang.NullPointerException: Null value appeared in non-nullable field: top level row objectIf the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use
* Null value appeared in non-nullable field
java.lang.NullPointerException: Null value appeared in non-nullable field: top level row object
If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).
解决:在dataframe中增加过滤row==null的Row
df.filter(row -> row != null)
* 编译问题,map修改row不生效:
ERROR CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "processNext()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB
/* 001 */ public Object generate(Object[] references) {
/* 002 */ return new GeneratedIterator(references);
/* 003 */ }
......(省略上万行源码)
原因:dataset中出现如下schema类型为null的字段(rank),发生原因是sql中使用了 null as rank语法。
|-- is_merchant_exclusive: integer (nullable = true)
|-- comment_keywords: array (nullable = true)
| |-- element: string (containsNull = true)
|-- date: date (nullable = true)
|-- generate_time: null (nullable = true)
|-- rank: null (nullable = true)
|-- prime: null (nullable = true)
|-- activities: null (nullable = true)
|-- categories: null (nullable = true)
|-- total_heart_num: null (nullable = true)
|-- ad_categories: null (nullable = true)
解决办法:
在dataset的map方法中,使用的schema必须先对上述null字段重新定义。
newFields.set(oldSchema.fieldIndex("rank"), staticSchema.apply("rank"));
...
*spark保存数据到hive时,Caused by: parquet.io.ParquetEncodingException: empty fields are illegal, the field should be ommited completely instead
原因:hive字段中存在map或array类型字段,但保存时,数据包含空array或空map的值。
解决办法:将空array或空map值(简称为空集合),修改为null,保存成功。
spark保存数据到hive时,不支持空集合,只能改为null再保存,但从数据文件导入到hive时则没有问题。
所以用spark读取hive时,会带入空集合数据,保存前需要改为null.
