Add example for PartitionedFile schema#22809
Conversation
eafcd76 to
7ae54d6
Compare
| //! (file: query_http_csv.rs, desc: Query CSV files via HTTP) | ||
| //! | ||
| //! - `remote_catalog` | ||
| //! (file: remote_catalog.rs, desc: Interact with a remote catalog) |
There was a problem hiding this comment.
Please add an entry for partitioned_file_schema
|
|
||
| let table_schema = Arc::new(Schema::new(vec![ | ||
| Field::new("a", DataType::Int32, true), | ||
| Field::new("b", DataType::Float64, true), |
There was a problem hiding this comment.
Please add a comment what is the purpose of field b. It is not mentioned at https://github.com/apache/datafusion/pull/22809/changes#diff-5097924e81226127006feb2aab9ff70726bf3ad7d6bb5d6d73a7a53f0412636bR45
There was a problem hiding this comment.
I added a comment for this field, let me know if it makes sense.
alamb
left a comment
There was a problem hiding this comment.
Thank you @fpetkovski and @martin-g
I ran it locally like
andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ cargo run --profile=ci --example data_io -- partitioned_file_schema
Finished `ci` profile [unoptimized] target(s) in 0.25s
Running `target/ci/examples/data_io partitioned_file_schema`
RecordBatch { schema: Schema { fields: [Field { name: "a", data_type: Int32, nullable: true }, Field { name: "b", data_type: Float64, nullable: true }], metadata: {} }, columns: [PrimitiveArray<Int32>
[
1,
2,
3,
4,
5,
], PrimitiveArray<Float64>
[
null,
null,
null,
null,
null,
]], row_count: 5 }
RecordBatch { schema: Schema { fields: [Field { name: "a", data_type: Int32, nullable: true }, Field { name: "b", data_type: Float64, nullable: true }], metadata: {} }, columns: [PrimitiveArray<Int32>
[
1,
2,
3,
4,
5,
], PrimitiveArray<Float64>
[
null,
null,
null,
null,
null,
]], row_count: 5 }
Got schema error: ParquetError(ArrowError("Incompatible supplied Arrow schema: data type mismatch for field a: requested Int64 but found Int32"))I took the liberty of pushing a commit to your branch to resolve a CI error: https://github.com/apache/datafusion/actions/runs/27138078890/job/80100749669?pr=22809
| /// already known, it can be supplied up front so this inference step is | ||
| /// skipped, saving an I/O round trip and metadata parse per file. | ||
| /// | ||
| /// The example writes a small Parquet file with a single `Int32` column `a` and |
There was a problem hiding this comment.
Thank you -- this is a nice description of what is going on
Which issue does this PR close?
Addresses the suggestion in #22360 (review) to add an example for specifying an Arrow schema for a
PartitionedFile.What changes are included in this PR?
datafusion-examples/examples/data_io/partitioned_file_schema.rs.Are these changes tested?
Tested with
cd datafusion-examples/examples cargo run --example data_io -- partitioned_file_schemaAre there any user-facing changes?
No user facing changes.
cc @alamb