But look up, all the IO streams for line breaks are loaded with buffers and then judged for each character, so the performance can only be up to the speed of a disk read. However, considering the XML document in the big data accounted for more than 90% of the space, then the positioning process can be optimized (LZ's problem in the output of the time to disk speed, so there is no need to consider the output here).
The idea is as follows:
1. When there is only one big data node in the XML, then obviously the start position of the offset offset value is very small, then read the judgment to the position of this value of time overhead will not be too large (milliseconds, can be ignored), then you can start outputting the file until the end of the encounter location. Obviously the time to locate in such a case is negligible and no optimization processing is required.
2. When there are multiple big data nodes in the XML, then the output of each big data, you need to first locate the beginning of the node (row and column numbers), and then start output. The same 1, the first file positioning time can be ignored, but the latter can not be ignored, because the READER is reloaded each time this file, so there is a repeat of the previous data to determine the problem, then the direction of optimization can be determined. Two options:
1) sequential output of each big data node, each time after the output of the record of the current reader's offset, recorded as n; and then the next big data node from the direct reader.skip(n), and then start to continue to locate backward ... And so on until the end of the visit.
2) According to the scheme of 1, it is clear that there is an obvious drawback, that is, it must be accessed strictly in order, otherwise it is still not optimized. For example, if there are 4 large nodes, accessed in order 4213, the only time overhead that can be optimized is the last accessed 3, and only the part of 1 is reduced, but the part of 2 is not optimized. So considering that 2 has already been visited, then when 3 visits it should start judging from the end position of 2; so the solution for 2 is to record the end position of each visit, and then each node starts by judging all the existing saved end positions, finding the nearest one as the start position for judging, and then start locating as well as outputting. This saves a certain amount of time for the sequence of 4213. In this case, the overhead of the positioning time can be ignored as 0 in the most optimal case; the worst case is all positioning from the beginning, such as 4321; but the average overhead is relatively smaller than 1.
3) For 1 and 2 can not quickly locate the 4321 case, you can then do optimization, such as beforehand on all the nodes to be processed to rank the start of the big data node position, and then scan the entire file, and mark the offset of each start position, so that in the output of any of the file can be quickly skip to the location of its marked. However the time overhead of this scheme will always be slightly less than two file accesses (slightly less than one for positioning (optimally 0, worst close to 1), and cumulatively slightly less than one for output).